McGRAW-HILL PUBLICATIONS IN PSYCHOLOGY 
CLIFFORD T. MORGAN, Consvuttinc EDITOR 


FUNDAMENTAL STATISTICS 
IN PSYCHOLOGY AND EDUCATION 


McGraw-Hill] Publications in Psychology 
CLIFFORD T. MORGAN 


CONSULTING EDITOR 


Barker, Kounin, and Wright—Curip BEHAVIOR AND DEVELOPMEN 

Brown—PsrcnoLocy AND THE SOCIAL ORDER 

Brown—Tur Psycuopynanics OF ABNorMAL BEHAVIOR 

Cattell—Pr SON ALITY 

Cole—GENERAL PsYcHoLocy 

Crafts, Schneirla, Robinson, and Gilbert—R sx 
IN PsycHotocy 

Davis—Psycnotocy or LEARNING 

Dorcus and Jones—Haxproox OF EMPLOYEE Str 

Dunlap—ReL1G10N: Irs Functions IN Human Lire 

Ghiselli and Brown—PERsoNNEL AND INDUSTRIAL Psycuo.ocy 

Gray—PsycnoLoGy IN Human AFFAIRS 


Guilford—FUNDAMENTAL Sratistics IN PSYCHOLOGY AND Epuca- 
TION 


Guilford—PsycnoserRie Mernops 

H urlock—ADOLESCENT DEVELOPMENT 
Hurlock—Cutip DEVELOPMENT 
Johnson—EssenrtraLs op PsycHoLocy 


Krech and Crutchfield—Turory AND PROBLEMS or Socran Psy- 
CHOLOGY 


Lewin—A Dynamic THEORY op PERSON 
Lewin—PRINcIPLES OF Torpor 
Maier—Frustration 

Maier and Sckneirla—Prixcrpipg or 
M: iller — EXPERIMENTS IN Socrar 
Moore—PsycHorocy FOR BUSINESS AND ĪNDUSTRY 

Morgan and Stellar—Pnysrorocrcar, PsycHoLocy 
Page—Apnormat PsYcHoLoGy 

Pillsbury—An ELEMENTARY Psycuonoay OF THE ABNORMAL 
Reymert—Preuinag AND EMOTIONS 

Richards—Moprrn CLINICAL PsycnoLocy 
Seashore—Psycnorocy or Music 

Seward—Snx Anp THE SOCIAL ORDER 

Stagner—PsycnorLocy OF PERSONALITY 


Wallin—Prrsonarrry MALADIUSTMENTS AND MENTAL HYGIENE 


EXPERIMENTS 


ION 


ALITY 
OGICAL PsycHoLoGyY 


ANIMAL PsYycHoLoGyY 
Process 


John F. Dashiell was Consulting Editor o; 
inception in 1931 until January 1, 1950. 


Ba 


Fundamental Statistics 


in 


toa 


i 
‘a 
pet 


Psychology and Education 


BY 


J. P. GUILFORD 
Professor of Psychology, University of Southern California 


i 
* 
z 


KA 


Calcutta 4 
2 8c, iy 


Srconp EDITION 
SECOND IMPRESSION —— 


McGRAW-HILL BOOK COMPANY, Inc. 
NEW YORK TORONTO LONDON 
1950 - 


i 
Cb) 


FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Copyright, 1942, 1950, by the McGraw-Hill Book Company, Inc. Printed in 
the United States of America. All rights reserved, This book, or parts there- 
of, may not be reproduced in any form without permission of the publishers. 


KAN Xa 
' Rene. 


3 CERT. West Benga! 


‘ Y Te 


+ 
Theo ace sane? 


Rac. NO.e 


THE MAPLE PRESS COMPANY, YORK, PA. 


Keep close to experience; add as little of 
your own as possible; if you have to add 
something, be mindful to give an account 
of every step you take.—F. M. URBAN. 


TEL 


-E da 


PREFACE TO THE SECOND EDITION 


Seven eventful years have elapsed since the appearance: of the first 
edition of this volume. So extensive have been the changes in research 
and instruction and so rapid have been the developments in statistical 
theory and method that the volume has been virtually rewritten. With 
greater recognition of the fact that a textbook in this field is inevitably 
forced into use as a handbook, many additional topics and features have 
been added. Many users of the first edition in classroom instruction have 
kindly made helpful suggestions. No topic was recommended for 
omission, but the author was urged to include many new ones. The 
consequence is a much enlarged volume. The emphasis remains upon 
applications rather than upon mathematical statistics. ; 

Among the topics receiving greater attention are those concerned with 
categorical data, sampling statistics, analysis of variance, prediction, 
validity of tests, and scaling procedures. Methods for 
ment of categorical data are introduced. Stand- 
ard errors of statistics obtained under varieties of sampling conditions 
are provided. More attention is given to the logic of statistical inferences 
and to small-sample statistics. Distributions of A chi-square, and of the 
F ratio are illustrated. A two-way classification problem in analysis of 
variance is described and illustrated. New methods of prediction of 
attributes are presented and also procedures for evaluating predictions, 
with tests of significance. Principles of multiple prediction are more 
clearly brought out and alternative procedures presented. Attention is 
given to the variances and correlations of weighted and unweighted 
a Or in the rationale underlying test procedures are 
dlected in a systematic manner. The different meanings and types of 
reliability and validity are delineated. Factor theory is now regarded as 

ial for the understanding of such phenomena as intercorrelation, 
ware ediction, and correction for attenuation. A brief introduction 
agra i oi O factor theory. A description of factor-analysis 
ie herein er, is much too demanding of space for inclusion in a 
prgeedues, m acter Scaling procedures have been moved to a final 
spain ae ehoa of paired comparisons has been included. 
Eo ee i greater care has been taken to attempt to give the 

hri j 


reliability and 
tabular and graphic treat 


vii 


viii PREFACE TO THE SECOND EDITION 


mathematically unsophisticated student an appreciation of the under- 
lying logic of statistics. Assumptions underlying methods are rather 
fully mentioned. An appendix has been added in which many simple, 
yet uncommon, proofs and derivations are presented. The limitations 
and peculiarities of each statistic are emphasized. It is hoped that all 
these measures will help to guide the reader in the proper use of statistics 
and in the avoidance of blunders. Many interrelations among different 
statistics, heretofore usually unrealized or ignored, have been pointed out. 
This should bring out for the student a much more coherent picture than 
before. Many points are liberally repeated in varied form, for emphasis 
and for evaluative reasons. 

A slight rearrangement of the order of chapters now places all the 
material that is likely to be needed in a first-semester course in statistics 
first in the volume. This material includes Chaps. 1 through 8 and parts 
of 9. The remainder includes more than enough material for a second 
semester’s work. 

The author is much indebted to the many users of the first edition who 
responded to requests for criticisms. Among these must be mentioned Dr. 
William B. Michael and Dr. Harrison F. Heath, who made numerous useful 
Suggestions. Dr. Michael has read a number of the chapters in manu- 
script form. To the Army Air Forces, from which source many a useful 
illustration has been drawn, theauthorowesverymuch. It hasbeen impos- 
sible to mention by name all the individuals who originated ideas (which 
have been borrowed to great advantage) as a part of their more or less 
anonymous contribution to the AAF Aviation Psychology Research Pro- 
gram. The author feels sure that none of the information revealed from 
that source will reflect anything but credit upon the AAF’s enlightened 


personnel-research program during the Second World War. 
ments for the use of other s 


: } pecific materials are given at appropriate 
places. The author 1s very much indebted to his wife, Ruth B. Guilford, 
and to Mrs. William W. Burke for considerable editorial assistance in the 
Preparation of the manuscript. 


Acknowledg- 


Brverty Hints, CALI. J. P. GUILFORD 
December, 1949 


PREFACE TO THE FIRST EDITION 


Since the publication of Psychometric Methods six years ago, the author 
has sensed a growing need for a supplementation in the form of a textbook 
on statistical method. This is so because the earlier volume em; basized 
methods of measurement rather than statistics and because ae in ai 
short span of years since its publication, some marked changes of eaii 
ome innovations have occurred within psychological end educational 


ands 
statistics. The present volume therefore attempts to conserve what is 
most useful among the old and to introduce as much of the new as seems to 


have contemporary and future value. 
The treatment presupposes no previous study of statistics and so strives 


to provide for the student a simple introduction. The more fundamental 
and useful procedures are outlined step by step and are fully illustrated. 
The selection of what seems most useful to the present student and 
investigator has been aided by the author’s experience in guiding students 
in research and in serving as director of the Bureau of Instructional 
Research at the University of Nebraska. His experiences with teaching 
the subject over a period of years have dictated the mode of presentation 
of statistical ideas and methods, but he realizes that sometimes what 
seem the best modes of presentation are none too good and that much 
teaching remains to be done. As a textbook, the volume will serve either 
for a one-semester introductory course or for a full-year course. 

Among the innovations in this type of text will be found several features. 
Some of the graphic devices for representing data are new to textbooks. 
The treatment of centile norms and of profiles based upon centiles is, the 
author believes, an improvement. A C-scale procedure for normalizing 
and standardizing scores is proposed and described, along with the tra- 
ditional T-scaling method. The growing emphasis upon sampling and 
drawing inferentes about populations from sample statistics is reflected. 
Small-sample statistics, including the ż ratio and Student’s distribution are 
ensive application. An introduction to analysis of variance 
d, and a novel, pedagogically simple derivation of the analysis- 
inciple is presented. Ina chapter quite new to this type of 
text, entitl 2d Testing Hypotheses, much of Fisher’s work is reflected, and 
chi square is given prominence. In another new chapter, entitled Pre- 
dictions and Errors of Prediction, some new devices of practical importance 
1x 


given ext 
is provide 
of-variance pr 


x PREFACE TO THE FIRST EDITION 


are introduced. In this chapter and in others, much attention is given to 
enumeration data and the statistics of attributes, a field that is growing in 
importance in the social sciences in general. A treatment of factor 
analysis has been omitted for the reasons that this subject cannot any 
longer be adequately presented in the space that a text of these propor- 
tions would permit, and its study and mastery extend well beyond the 
student's first year of statistics. Even the final chapter on Mental Tests 
had to be treated rather sketchily in order to remain within reasonable 
bounds of space allotted-to the volume. In general, there was recognition 
of the limitations, self-imposed, where references to more advanced treat- 
ments were regretfully made in order to stay within the bounds of a funda- 
mental statistics. 

The author gladly expresses acknowledgments and thanks to Prof. 
Harry Helson, who read and criticized three of the chapters. To H. M. 
Cox, with whom the author was associated in the Bureau of Instructional 
Research, he owes much for certain ideas regarding ways of presentation 
of data and concerning the selection of useful methods. To his wife, 
Ruth B. Guilford, the author is, as always, most indebted for constant 
help in the preparation of the manuscript. To publishers and authors 
who have generously permitted the reproduction or use of material he is 
grateful. These and other contributions are acknowledged specifically at 
various places in the volume. To Prof. R. A. Fisher and to Messrs. 
Oliver & Boyd of Edinburgh the author is indebted for permission to 
reprint Table E from their book Statistical Methods for Research W orkers, 
8th ed., 1942, 


J. P. Guitrorp 
SANTA ANA, CALIF. 


September, 1942 


a 


i 
| 


CONTENTS 


PREFACE TO THE SECOND EDITION. . . . . . i 
EERSTE c. C TERI" vii 

PREFACE TO THE First EpIvioN. . .... m 
PERETELE TTET ix 

| 1.. INTRODUCTION FOR STUDENTS. ...- +--+ +--+ 1 

| | 2. CouNTING: AND MEASURING. . o cssc 8 eas 13 
f a MME an co < i Heb ak naw O oe 14 
Measurements. = > s e ëse me a arte n E A 28 

l T o eo de Ge ERT SRP RAS wa Be x © aN R 37 
4 3. FREQUENCY DISTRIBUTIONS. . 6 s ee ee e ee 39 
| The Class Interval—Its Limits and Frequencies... o s 2) ee ee ee 39 
Graphic Representation of Frequency Distributions. . . . . s o >o 43 

Erri aK FSR EK ORAM ee OBER RS OT 56 

| 4, MEASURES or CENTRAL TENDENCY . . se. occae didenes 58 
THe Avithnetio Mean. s oro ca SERT Eime memo EY oo 59 

THE NEMA a au ee ok SRE OH Shee we a E 64 

i THEMOUE e ca erana RE PAS e Bae eA He D 69 
When to Employ the Mean, Median, and Mode: . 2 fe jaw ee 73 

Means in Some Special Sitwations: © 66 eee E E 79 

Heer i EMR SS ee oe BRD Ee ee Ee ee 85 

| 5. MEASURES OF VARIABILITY) s Šon 0-2 Aim oe es ... 8&8 
| Phe otal: RIE + o oe ET Te ee Sema eS se ee 
. The Semi-interquartile Range—Q. o c asermat E 
The Average Deviation. s% =w =en t rts 1 oe oa 6 aoe 82 

. The Standard Deviation. . . - -= -> 95 
P Descriptive Use of Statistics . . > - - i A SRS TEE EE. 114 
Uses and Interrelationships of Different Measures of Dispersion. . . . 116 

The Coeflicient of Variation... 2-2 ees ee ee 118 

Exercises. . +--+ °° 120 

r 6. CUMULATIVE DISTRIBUTIONS AND NORMS. . - - - pyg 121 
Cumulative Frequencies and Cumulative Distribution Curves... .. . iji 
` a Centile Norms . +--+ °° be ome ae Awe OR a aa ee a A 
Erria wae ET Se re ene TE E: | 
7. Tue NORMAL DISTRIBUTION CURVE. - x radno etra emam 134 
The Nature of the Normal Curves = = =s eesse smeti pee ee = HOR 
Areas under the Normal GEVE: o eos Boh GEER TH Re Tae ES 144 

Se is tina oa) Rae cee pe dra Bias ey Re 


Exercises. - + ` ` ` 
xr 


xii 5 CONTENTS 


8. CORRELATION: ...... ae eee FS i a Se et Her O | a 
The Meaning of Correlation. ...... a Gee St SS aeRO EA 154 
How to Compute a Coefficient of Correlation. . . . . 2... 2.2.2... 157 
Interpretations of a Coefficient of Correlation. .. 2... . Saas a ¢ we 
Graphic Representations of Correlations. ....... PAH Rg ee pe oy Oe 
Assumptions Underlying the Product-moment Correlation... 2... . . 169 
Py RCrCISesi ca aig a YB. Sk. Poe at ve) Sw: war es a Tame ee a eaw ee we OL 

9. THE RELIABILITY AND SIGNIFICANCE OF STATISTICS... . . syle & gu ei eS 
Some Principles of Sampling... . 2. . PM PF Ree oe aa wee a we 
The Reliability of Averages... 2 2. 1. esita ERE Pe wo ae we LEO 
The Reliability of Other Statistics. . . . . . , eo oe a ee es Oe a DOF 
The Reliability of Differences... . . . . FERT EEE TETTERE 
Small-sample Statistics. . . . .. ae AS ee Gy GY Bw Hie Bowe a Ae ee 
FEXORCIEESS ta. tes ac a s a Be ie ba: ee fe we ee Ge ee Alpes WS eoeing “Suomen. ox Oe 

10. INTRODUCTION TO ANALYSIS OF VARIANCE... 2... ee ee ee es 236 
Analysis in a One-way Classification Problem. . . . . . oe ee wwe & oy OO 
Analysis in a Two-way Classification Problem... ........... 244 
An Evaluation of Analysis of Variance... . . Se Gk hee Td Wi Aw GN ae ty ER 
Exercises. § 2 ws Soe moegy otek ew A A LY Ko te OD. 

11. TESTING Hyporuesrs........., pNP: awe SP de wh ER, wig acy we OU 
AURORE a af sek yale aia ¥ we XG a Gace een wn DOE 
huvSquare, = 5.4. o = ee Se eB a hws born oO gl a py 2S 
EXEKCISESs op. cw we 3 Re Eemia o wh ws ee ge Se De Fos 28S 

12. Test SCALES AND Norms... ......, TETEE ee 2k, a ee BBS 
Standard Scores. . ... EARLE EEET ETETE TTEFT 290 
The T Scale and T Scaling of Tests... 2... Baas ta ne #296 
The C Scale and C Scaling. .. 2... uaua. ea ee = me 02 
Some Norm and Profile Suggestions. ... . 2... WK 2 * At a Belo 
HERETICS Ge sie st ta: a y gees SER am Oe ae om ae SETET 

13. SPECIAL CORRELATIONS, METHODS, AND PROBLEMS . . . 2... . 310 
Spearman’s Rank-difference Correlation NERGA! E x oe x aa a BLD 
The Correlation Ratio... , . , at . 314 
The Biserial Coefficient of Correlation, . ETTET TEE TE 
Point-biserial Correlation... . 2... We ee a eu cue Gale TEE 
Tetrachoric Correlation... . |, at Ne ete Baa, oe ve USO 
The Phi Coefficient... s<.. in a ee BG. ye ea oe eee ee. BBD) 
Partial Correlation © sa ss eS Baw Gi un ww a . 345 
Some Special Problems in Correlation . . 2...) , |... BAT 
Ers s ei: ss ao CEP Be kw ton ae dw ge 2 Re S60. 

14. PREDICTION OF ATTRIBUTES. ........ 2 i Se aba ie ae ee BOS 
Predicting Attributes from Other Attributes... w eo me e e , a 365 


Predicting Attributes from Measurements . . ea Bred TE sie: a SA 
a EREEREER TEE TEL: ow yw -i 


16. 


re 


18. 


19. 


APPENDIX 


CONTENTS 


Regression Equations . . - - 
The Correlation Coefficient and Accuracy of Prediction. . . . 


Multiple Correlation 
Some Principles of Multiple Correlation... <- ss st 
Multiple Correlation with More Than Three Ae iis. sole. oe 
Short Solutions for Regression Weights ee e 
Combinations ea Sere 
Alternative Summarizing Methods... s a a ete ep ee 7 à : i 


Exercises. -- -7 


RELIABILITY OF MEASUREMENTS 
Reliability Theory. - > + = e sons 
Methods of Estimating 
Internal-consistency Reliability 
Some Special Problems in Reliability . - 
Reiss. yo p Eomaia p aa a a a 


VALIDITY OF MEASUREMENTS. - + - 


Problems of Validity. - + +--+: "°° 
e iniet Taereducliita Rector TRIO: «+ <a | A a 
Conditions upon Whi 
Exercises. > - + - Pg bie 


SCALING PROCEDURES 
Scaling Test Items for Difficulty. -- =+ °° 
Measurements from Judgments ob Rank Orders «grape arty 25m 
Scaling Judgments from Paired Comparisons..« 2 = 0+ qi ot ee 


Scaling Judgments in Successive Categories. -- -= < «> 
Transforming One Distribution into Terms of Another, - OE 


exercises: nan ob Hee PRE 


As Some Selected Mathematical Proofs and Derivations. . ++: 
“Tables 2 a 4 * a eh a 8 
nd Index of Symbols . - - 


_C. A Glossary a os ane 


AUTHOR INDEX. >» ` 


almost equally 1 


CHAPTER 1 


INTRODUCTION FOR STUDENTS 


This book was written for students. It was written by one who | 
AR zho has 

me in psychology and education. He 

students prepare for thei fon: 

y > 3 r profe: 

later at work in their professions. He has seen them as esas 

mastering the methods of their professio He has seen thers ep = 

in the laboratory, in the clinic, and later in the industrial and a 

laboratory of research and the personnel office. There has been ned 

y in these experiences to see what the student needs in the way 


known many students, p 
has seen many generations O 


opportunit 


of statistics, and why- 
Why the Student Needs Statistics.—Most seasoned workers in psy- 


chology or in education usually take the statistical methods for granted 
as an essential part of their routine; some more so, and some less. 
The initiate may at first react to statistics as a frightful bogie whose 
mysteries loom forbid a, and he is likely to ask, “What is 


ding before hin 
the good of them, anyway?” This is particularly true of one who feels 
détrouble with numbers. 


he has always ha Students who enter a first 
course in statistical method in psychology or education, and probably 


in all related social sciences, range all the way from those who find mathe- 
matics in general easy and to their liking, to those at the other extreme 
who say they have. difficulty in adding two and two. Somehow, all of 
these must acquire what they can of a subject for which they are so 


unequally prepared. 
Probably no other subject demonstrates so clearly that there are several 
lectually than Charles Darwin 


kinds of intelligence- No less a person intel 
had trouble with statistics, as he is said to have frankly admitted. His 


illustrious. cousin, Sir Francis Galton, who is believed to 
IQ of about 200, and who had so much to do with intro- 
cs into psychology, had to turn some of his mathematical 
to others for aid. a 
different ways of understanding the same things. One 
asp the new ideas offered by statistics in the way that a 
ould understand them; another will appreciate the logi- 
nd the concepts provided as aids in thinking; still 
i Pa ations and be able to carry through 


have had an 
ducing statisti 
problems over 
There are 
student will gr 


mathematician w 
cal rules of thinking * 
others will master rule-of-thum 


4 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ing and almost meaningless. Before we can see the forest as well as the 
trees, order must be given to the data. Statistics provide an unrivaled 
device for bringing order out of chaos; of seeing the general picture in 
one’s results. 

4. They enable us to draw general conclusions, and the process of extract- 
ing conclusions is carried out according to accepted rules. Furthermore, 
by means of statistical steps, we can say about how much faith should 
be placed in any conclusion and about how far we may extend our 
generalization. 

5. They enable us to make predictions of “how much” of a thing will 


fe happen under conditions we know and have measured. For example, we 


can predict the probable mark a freshman will earn in college algebra if 
we know his score in a general scholastic-ability test, his score in a special 
algebra-aptitude test, his average mark in high-school mathematics, and 
perhaps the number of hours per week that he devotes to studying algebra. 
Our prediction may be somewhat in error because of other factors that 
we have not accounted for, but our statistical methods will also tell us 
about how much margin of error to allow in our predictions. Thus not 
only can we make predictions but we know how much faith to place in 
them. 

6. They enable us to analyze some of the causal factors out of complex and 
otherwise bewildering events—It is generally true in the social sciences, 
and in psychology and education in common with them, that any event 
or outcome is a resultant of numerous causal factors. The reasons why a 
man fails in his business or in his profession, for example, are varied and 
many. Causal factors are usually best uncovered and proved by means 
of experimental method. If it could be shown that, all other factors 
being held constant, certain business men fail to the extent that they 
possess some defect of personality “X,” then it is probable that X is a 
cause of failure in this type of business. Unfortunately for the social 
scientist, he cannot manage men and their affairs sufficiently to set up a 
good experiment of this type. The next best thing is to make a Statistical 
study, taking business men as we find them, working under conditions as 
they normally do. The life-insurance expert does the same kind of thing 
when he follows the trail of all possible factors that influence the length 
of life and determines how important they are. On the basis of these 
statistical findings, he can predict about how long an individual of a cer- 
tain type will probably live, and his insurance company can plan an 
insurance policy accordingly. Statistical methods are therefore often a 
necessary substitute for experiments. Even where experiments are possi- 
ble, the experimental data must ordinarily receive appropriate statistical 


INTRODUCTION FOR STUDENTS 5 


treatment. Statistical methods are hence the constant ‘companions of 


experiments. 
What This Volume’s Treatment of Statistics Will Include——For the 


next few paragraphs we will take a hasty overview of the things to come 
The second chapter will give many more details of a general and one 


paratory nature. Here we will try to look at the whole forest before we 


enter it. 
Descriptive and Sampling Statistics —With some writers it is common 


to make a broad distinction between descriptive and sampling statistics 
The distinction is real and should be realized, although we need not ke 
perpetually aware of it. It refers to two important uses of statistics 
In the first place, statistics are used to describe situations. Avera es 
tell us “how much” of certain quantities we have in a group of indi 
viduals or in a group of observations. An average (e.g., arithmetic mean. 
median, Or mode) is a general-level concept. A single number tells ee 
high one group, oF sample, stands on a certain scale as compared with 


another. 

Other statistics t 
of a group show. 
the almost universal indica 
individuals or observations, 

A coefficient of correlation describes the closeness of relationship between 
two sets of measures of the same group of individuals or observations. 
Most of science is concerned in finding out what things go with what, 

of what. Correlation methods, in the 


and what things are independent 
social sciences at least, are the most useful devices to answer these ques- 


tions of interrelationships- Averages, indices of dispersion and of corre- 
lation, are the basic and chief descriptive statistics. 

Sampling statistics have become increasingly important in recent years. 
Their use is to tell us how well the statistics we obtain from measure- 
ments of single samples probably represent the larger populations from 
which the samples were drawn. Almost every statistic has a standard 
error. A standard error is an index number that leads us to conclusions 
about how far the statistic derived from the sample probably differs from 
the value we would obtain if we had measured an entire population. A 


population isa well-defined group of individuals or of observations. For 


example, it could be one composed of Wistar-Institute albino rats between 
the ages of 30 and 60 days. Or it could be all possible reproductions a 
certain observer CO 


uld make of a line 10 centimeters long under the same 
conditions of rest, time of day, and method of reproduction, e.g- by 
drawing a line with 


ell us how much variability or scatter the individuals 
A statistic known as the standard deviation has been 
tor of the amount of variability in a set of 


though there are others. 


a pencil. A sample in either case would be a limited 


2 į f F 7. EDUCATION 
6 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCAT 


number of observations out of the entire population. Arriving at on 
clusions that can be generalized to all members of a population Aepends 
upon reducing discrepancies between population values and sample values 
to as small size as possible. This js probably best illustrated by the 
public-opinion polling, in which the margin of error of voting outcome 
can be expressed in terms of a percentage of error. NE: 

In connection with sampling statistics, there is much in this volume on 
testing hypotheses. Scientific investigation proceeds from hypothesis to 
hypothesis. There are numerous hypotheses but relatively few estab- 
lished facts of a general nature. The sooner the research student realizes 
this point, the better for his clear thinking. There are some investigators, 
many of them well experienced, unfortunately, who do not make this 
distinction between a hypothesis and a fact; they mistake hypotheses for 
facts. For example, there is the hypothesis, stemming from Freudian psy- 
chology, that children suffering from asthma are of the “‘oral-dependent” 
type and that the breathing spasms are expressions ofa cry for aid and 
for love. The plausibility of the idea, and its apparent consistency with 
other ideas, may be sufficient to lead many a clinical or psychiatric investi- 
gator to act as if the problem were solved; as if the idea were a fact. 
The properly skeptical investigator makes a study of a sample of asth- 
matic children and of their nonasthmatic siblings to see whether there is 
any greater incidence of dependency among the one 
the other. Probably the most fruitful scientific inve 
those that lead to dependable answers, or those tl 
exploratory stages, start by setting up a hypothesis, 
tive hypotheses. Conditions are then arranged in suc 
results turn out one way, the hypothesis, or one of 
supported and other hypotheses are rendered doubtful, 
usually be cast in a statistical form which make 
between hypotheses. 

The simplest example of this is seen where we ar 
of one thing on another. Let us suppose that it is t 
on ability to reason. We restrict our problem t 
mutually exclusive hypotheses: (1) that benzedri 
output or efficiency and (2) that it will not. 
be subdivided into two; that thinking will be {: 
ing will be hindered. The typical experime 
somewhat as follows, briefly described. We develop or adapt a test of 
reasoning power. We select two groups of individuals of comparable age, 
education, and JQ, both of the same sex. We determine that they are 
equal on a preliminary trial of the reasoning test. We administer the 


group than among 
stigations, at least 
hat go beyond the 
or several alterna- 
h a way that if the 
its alternatives, is 

The results must 
s possible a decision 


e studying the effects 
he effect of benzedrine 
o two alternative and 
ne will affect thinking 
The first hypothesis can 
acilitated and that think- 
ntal operations would be 


INTRODUCTION FOR STUDENTS 7 
í 


drug to one group and a, control dose, or placebo, to the other Nei 

group knows which has taken the drug. We administer anoh ; my 
the reasoning test. We obtain two average scores and pe es a 
difference in a certain direction. The question is, does o S earn 
difference support hypothesis (1) or hypothesis (2)? Could oe pos 
ence have occurred by chance? If not, it must have been due t ma 
drug, for so far as we know there is no other difference between he : s 
groups that could account for it. It requires a test of the statistical ita 
nificance of the difference to permit us to reject one hypothesis and pen 
the other. Having rejected the idea that the difference was due to Ta 
we may accept the idea that it was due to the drug. Without the 
statistical test we would be rather helpless in reaching a dependable 


answer. 
The Normal Distribution 


normal distribution curve; 


Curve.—Every student is familiar with the © 
it is ubiquitous in psychological and educa- 
tional literature. There has been much use and abuse of it, and many 
erroneous things are said about it. The curve itself is a mathematical 
conception; it do in nature; it is not a biological or a psy- 
chological curve- hich we can apply to useful 
purpose jn many 
applied statistics : he 
matics) must be ke Many fruitful applications of the normal 
distribution curve F y and education will be described in later 
oer. “These apP ; re usually made without proof that human 
ar aons åre nok mally : tributed but with the assumption that they are 
normally distributed in order that we may benefit from the use of the 

Tf there were knowledge 


i i he n 
mathematical roperties of t na 
P f human qualities to the contrary, we E 


about distributions % i-ati 
ap Jications- 


course, forego these > 
is therefore essential. 


and its properties * 
` : s. 12 and 19. i 
peno statisties: ~ hree pee w organized under the 
heading of” rediction.” Most textbooks of beginning psychology start 
eading of “PC jt is the putpos® of psychology to predict and control 
: not much more is said about 
d intricate set of phe- 


out by saying, that point om, 
human behavior. with the very complex an 
ms presents, and realizing the 


rediction- Dealing TA : 
s ictio: a peha vior of living organs : 
omena tha dicti it is appropriate for us to be modest 


limitations to °° ane | guilty, however, about our failures 


on the subject: , a comparable with those in the physical sciences to 
to make pr 


oe io Cee ann realistic efforts to achieve the 
the extent tha 


between mat 


t we ie 


8 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


predictions that are possible, nor should we dis 
E~ ee iia is actually made even when we do not 
realize it. The vocational counselor who tells a client that he should con- 
sider seriously vocations P, Q, and R and should shy away 
tions V, U, and W is tacitly predicting success in the one gro l 
in the other. The clinician who diagnoses a person as having an anxiety 
rosis is saying that he expects of this individual certain behavior. 
If he prescribes a certain program of therapy, he is Predicting 
ment under that treatment versus lack of improvement if it is no 
The promotion of a child to the next higher grade is a prediction that he 
will probably adjust better to that assignment than to reassignment to 
the same grade. Thus, almost all therapies and administrativ cisi 
are, in effect, predictions, whether those who make those prescriptions 
would be willing to put themselves on record as making predictions or not. 

All predictions in psychology and education are what we often call 
actuarial. That is, they are made on a statistical basis and with the 
knowledge that only “in the long run” will the practice that each pre- 
diction stands for be better than otherwise. Prediction of the single case 
is recognized as being involved with many chance elements, „For the 
single case, the prediction is correct or It is incorrect, depending upon 
standards. In predicting in large numbers, there are certain Probabilities 
of being right and being wrong. The degree of rightness or wrongness 
can then be determined. Statistical methods Provide the b 
choosing what prediction to make and also a basis for knowi 
the odds are for being right or wrong. The various w 
predictions and the ways of determining their de 
treated at great length in Chs. 14 to 16. 

Test Practice and Statistics. —Because tests 
in psychology and education, considerable 
them in this volume. Recent thinking by 
educators has almost revolutionized our for 
instruments of measurement. We may expect in the next ten years con- 
siderable progress in working out the implications of these Suggestions, 
Many of the findings have been reflected in the chapters tr 
particularly Chs. 17 and 18. Certain ideas of reliabilit 
tests had become rather securely intrenched in the thought and Practice 
of test users. These ideas are reexamined and the newer experiences have 
been used to advantage in the applications of Statistics to test Practice, 

The Student’s Aims in His Study of Statistics — with this overview of 
content and with the preceding view of the needs and advantages of sta- 


parage our accomplish- 


from voca- 
oup and failure 


improve- 
t applied. 


e decisions 


asis for 


ng what 
ays of making 
Bree of accuracy wil] be 


play such an important role 
attention has been given to 
statistical Psychologists and 
mer understanding of tests as 


eating tests, 
y and validity of 


SO 


INTRODUCTION FOR STUDENTS 9 


tistics, what should the student, particularly the beginner, aim to do 
P kaa > 

about it? In the opinion of the author, the beginner’s aims may be 

listed as follows, in order to make his task more specific. 

1. To master the vocabulary of statistics—In order to read and under- 
stand a foreign language, there is always the necessity of building up an 
adequate vocabulary. To the beginner, statistics should be regarded as a 
foreign language, which he should resolve will not for long remain entirely 
foreign. ‘The vocabulary consists of concepts that are symbolized b 

8 y : y 
words and by letter symbols that are substituted for them. Along with 
mathematics in general, statistics shares the ordinary symbols for numeri- 
cal operations. Thus, much of the vocabulary is already known to the 
student. As for the new concepts, their meaning will continue to grow 
the more the student uses them. ont f 

2. To acquire, or to revive, and to extend skill in computation.—Although 

4. > s : 
it was stated earlier that it is not an important aim for the student to 
become a statistical clerk, computation is important. For many people, 
the understanding of the concepts themselves comes largely through 

$ mae i ions. The mere step-by-step activi- 
applying them in computing opemuons k j p y-step aciyi 
ties with numbers when certain goals are in mind, provide opportunities 
for new insights to occur. The average investigator is never free from a 

a 8 tion work to be done. Computation skill 
certain amount of computatio : eee 

f formulas as well as planning efficient 


a io includes application of 70) ining 
malin tke sey aul grows with practice. If there is discourage- 


urther attempts should correct that. o 
et statistical results correctly.—Statistical results 
he extent that they are correctly interpreted. 
tions extracted from data, statistical 
f meaning and significance. Inade- 


operations, 
ment at first, f 

3. To learn to interpr 
can be useful only to t 
With full and proper interpreta 


st powerful source o 
Pe are à en Y they may represent wasted effort. Erroneously 
quately inte . 


5 han useless. It is the latter eventualit 
accounted for, a vor Se remark “Anything can be sia. 
that leads to tue a" hands of skilled operators, statistics make data 
by statistics.’ oe fi a very important that the implications of any sta- 
“talk.” 5 7 meat o! er meaning be made manifest. 
tistical result be res 


lized and that their prop 

der is less able to interpret the result than the investi- 
The average reace i 
gator should be. Upon hi 


$ shoulders rests the responsibility of telling 
jons S 
the reader what the conclusions 


hould be and to include, also, some 
ae ations of thos clu ~~ pe 
rig oe a statistics — Statistics provides‘a way of thinking 
4. To grasp the t0s™ © 


lary and a language. It is a logical system, like all 
a ah a oo LY eculiarly adaptable to the handling of rational 
mathematics, 


e conclusions. 


10 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


problems in science. This is hard to explain to the beginner. It is 


hoped that it may become more apparent as later chapters, particularly . 


those dealing with sampling errors, hypotheses, predictions, and factor 
analysis, are encountered. The most efficient investigator is the one who 
masters the logical aspects of his research problem before he takes recourse 
to experiment or to field study. Proper formulation of a research prob- 
lem is more than half the battle. Too many inexperienced investigators 
think of a question or a problem and rush to gather data before knowing 
what it is they really want to observe. Because it is realized that data 
of some kind must be collected, much time and effort are wasted in 
collecting the data, without thinking through the problem and coming 
to the proper decision as to just what kind of data are needed. Or, data 
are collected in such a manner that no statistical operations now known 
are adequate to treat the data so as to extract an answer. Well-planned 
investigations always include in their design clear considerations of the 
Specific statistical operations lo be employed. 

5. To learn where to apply statistics and where not to-—While all sta- 
tistical devices have their power to illuminate data, each has it limita- 
tions. In this respect the average student will probably suffer most from 
lack of mathematical background, whether he realizes it or not. Every 
statistic is developed as a purely mathematical idea. As such, it rests 
upon certain assumptions. If those assumptions are true of the par- 
ticular data with which we have to deal, the statistic may be appropri- 
ately applied. The student should note wherever a new statistic is 
introduced that there are likely to be mentioned certain assumptions 
or properties of the situation in which that statistic may be utilized. 
Unfortunately, one can encounter masses of numbers that look as if they 
are candidates for the use of a certain statistic, e.g., a biserial coefficient 
of correlation (see Ch. 13), when actually to apply the statistic would be 
meaningless if not actually misleading. The student without mathemati- 
cal background will have to learn these exceptions by rote memory or be 
satisfied with common-sense reasons. He probably would prefer to avoid 


making ridiculous applications, and when in doubt he should seek advice 
or refrain from the doubtful application. 


6. To understand the underlyin 


g mathematics of Statistics.—This objective 
will not apply to all students. 


But it should apply to more than those 
with unusual previous mathematical training. Many an intelligent stu- 


dent who has not been introduced to analytical geometry or calculus can 
nevertheless grasp many of the mathematical relationships underlying 
statistics. This will give him more than common-sense understandings 
of what goes on in the use of formulas. For the student with mathe- 


INTRODUCTION FOR STUDENTS li 


matical background and for all others who wish to know more about th 
_ underlying basis of statistics encountered in the following chapter ie 
best single source is to be found in the book by Peters acd Van Vo his 
listed below. We cannot take space to duplicate such roofs in this 
volume. ‘There are provided in the Appendix, however, rs few F an 
matical derivations of formulas. The selection has been oial] a 
two considerations: (1) the only mathematics required to ie ee 
proofs is that of ordinary algebra and basic calculus, and (2) the proofs 
are not readily available elsewhere, either because they do not 5 
elsewhere or because the sources are scattered. ee: 


Some Suggested Aids in Learning Statistics 

gestions to support the material in this volume. 
mentary Algebra.—Some students who have not kept 
d in arithmetic and elementary algebra frequently 
cts, short of the employment of tutors. 
that he consult H. M. Walker’s Mathe- 


Following are a few practical su; 
A Review of Arithmetic and Ele’ 


skills they once acquire 


alive the 
viewing those subje 


feel the need of aids in re 


To such a student it is strongly recommended 
matics Essential for Elementary Statistics, New York: Holt, 1934. This little volume 


ides an excellent review in the form of selected exercises of the things that are most 
and in which many students show forgetting. The book is especially recom- 
t who has forgotten his high-school algebra. 
ks.—For the first and second semesters’ courses in which this 
nd useful the two volumes by J. P. Guilford and C. Lovell, 
Hills: Sheridan Supply Co., 1946, and Advanced 
1950. The first accompanies Chs. 2 through 
of the remaining material of this volume. 
will make as much use as possible of all 
Iculating machines, tables, and the like. 
There are inexpensive slide rules now available that will serve when three-place accuracy 
il] take care of 2 large part of one’s computations. Barlow's 
Tables, New York: Spon and Chamberlain, are admirable for supplying squares, square 
roots, and reciprocals for numbers from 1 to 12,500. J. W. Dunlap and AL K. Kurtz 
have provided many charts, tables, and formulas in their Handbook of Statistical Nomo- 
s-on-Hudson, N.Y.: World, 1932. Where great 
rmal curve is desired, the recommen- 


graphs, Tables, 4 
accuracy in nume The Kelley Statistical Tables, New York: 


dation is the mono: 
Macmillan, 1938- ae 
Other Books on Statistics. Other statistic 
gator in psychology and education i a aas 

as follows: 
New York: Rinehart, 1944. 


Statistical analysis. 
taion analysis. 2d ed. New York: Wiley, 1941. 


prov: 
needed 
mended to the studen 

Statistical Workboo 
text is used, the student will fi 
Elementary Statistical Exercises, Beverly 


Statistical Exercises, by the same publisher, 
he second covers much 


ds.—The wise student 
aids in the form of ca 


rical values 


graph by T. L. Kelley, 


al books in which the student or investi- 
dinate and supplementary reading are 


Epwaros, A. L 


Eos fethods of corre! 
as hig *statistical methods for research workers. 8th ed. Edinburgh: Oliver & 
or, R. A. 
1. i 
Boyd, 194 cs in psychology and education. 3d ed. New York: Long- 


E. Statisti 
47. 


Garrett, H. 
mans, 19 


12 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Gouren, C. H. Methods of statistical analysis. New York: Wiley, 1939. 

GUILFORD, J. P. Psychometric methods. New York: McGraw-Hill, 1936. 

HOLZINGER, K. J. Statistical methods for students in education. New York: Ginn, 
1928. 

KELLEY, T. L. Statistical method. New York: Macmillan, 1938. 

Fundamentals of statistics. Cambridge: Harvard University Press, 1947. 

Linpguist, E. F. A first course in statistics. Boston: Houghton Mifflin, 1938. 

Statistical analysis in educational research. Boston: Houghton Mifin, 1940, 

Peters, C. C. anD Van Voornts, W. R. Statistical procedures and their mathe- 
matical bases. New York: McGraw-Hill, 1940. 

Snepecor, G. W. Statistical method. 3d ed. Ames, Iowa: Collegiate, 1940, 

Tretoar, A. E. Elements of statistical reasoning. New York: Wiley, 1939. 

WALKER, H. M. Elementary statistical methods. New York: Holt, 1943, 


CHAPTER 2 
COUNTING AND MEASURING 


Two Kinds of Numerical Data—Numerical data generally fall into 
two major kinds. Things are counted and this yields frequencies, or 
things are measured and this yields metric values, or scale values. Data 
of the first kind are often called enumeration data and data of the second 


kind are called measurements or metric data. 
Statistical procedures deal with both kinds of data, which is the reason 


for this chapter. There are certain fundamental ideas about numbers 
and their use that it is well to have in mind before we go ahead. Perhaps 
it may seem strange to the reader, who has been counting and measuring 
as long as he can remember, that we should have to devote an entire 
chapter to these topics. The experts, who, we will have to admit, have 
had a great deal more experience with numbers and their use than most 
of us have had, never cease to report new ideas and insights as to the 
properties of the number system and as to its applications. It is well to 
keep in mind, incidentally, that there is a real difference between the 
number system, as such, and its application to counting and measuring. 
Much confused thinking has resulted from ignoring this fact. The world 
does not necessarily owe its existence to number and quantity. Numbers 
were invented by man as a symbolic system of internally consistent ideas 
which he can use effectively in describing the world as he knows it, thus 
gaining control over it. 

Data and Statistics—Before we go further, there are some frequently 
used terms that should be defined. These words are statistics and data. 
The word statistics itself has several meanings. On the one hand it stands 
for a branch of mathematics which specializes in enumeration data and 
their relation to metric data. That is the meaning in the title of this book. 

Another meaning, popular but not used by technical people, is implied 
in the mother’s statement when she says, “Bobbie, stay out of the street, 
or you will become a vital statistic.” Here the term in the singular 
refers to a fact of classification, which is a chief source of all statistics. 
What the mother meant is that Bobbie would change classification from 
the category “living” to the category “dead.” The keepers of vital sta- 


tistics in the department of health and in other governmental agencies 
13 


14 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


would have one less case among the living and one more among the no!- 
living. This use of the term “statistics” is more common among those 
agencies that keep the records. The numerical records are the statistics. 
While this use of the term is recognized by teachers and writers who 
specialize in statistics as a subject, their use of the term and the use of 
it in this book will usually mean something else. In the textbook and 
classroom situation, we are more inclined to use the word data in referring 
to details in the numerical records or reports. The fact that Bobbie is 
classified either among the living or the not-living is a datum. The word 
data always refers to more than one fact. 

In the textbook and classroom situation, too, the singular term statistic 
is most likely to mean a derived numerical value such as an average, a 
coefficient of correlation, or some other single descriptive concept. It 

` may refer either to the idea of an average, a median, a standard deviation 
etc., or toa particular value computed from a set of data. The reader can 
usually tell from the context which usage of these terms is meant. 


” 


DATA IN CATEGORIES 


Probably most social data are in the form of categorical frequencies; 
the number of cases in defined classes or categories. The number a 
births, marriages, and deaths constitutes the bulk of the so-called vital 
statistics. The number of accidents, fatal or otherwise; the number of 
arrests for different reasons; and the number of new cases of poliomyelitis 
constitute other important information by which social agencies keep a 
finger on the pulse of human affairs. Political and economic interests 
also have their “barometers” for keeping informed of the trend of events 
though some of these depend upon measurements of variables as well i 
upon counting cases. 

Classification.—Before we count, in order to accumulate useful infor- 
mation, we must know what it is we count. We do not count indis- 
criminately. The frequency that we record refers to a particular class of 
objects, and this involves the process of classification. Classification of 
objects has been going on since Aristotle and even before Aristotle. Tt is 
a basic psychological process which can be seen in rudimentary form even 
in the simplest conditioned response. Wherever discriminations are made 
along with generalizations, classification of a sort occurs. Useful Ta 
fications for counting purposes, however, depend upon a high type of 
logical analysis. Much of science, following Aristotle, has been of the 
classificatory type. The classification of plant and animal life into species 
genus, and order is the best example. Things thus become ordered na 
principles emerge. 


COUNTING AND MEASURING 15 


As science progresses, it is likely to abstract variables from its data; 
giao aea = single directions. This provides the way for 

c easurements. In spite of this general trend in a 
science, however, the classification of phenomena will probably never 
cease to be useful. Besides, there are some absolute categories that 
seem not reducible to continuous variables—life and death; married and 
unmarried; male and female; and voter and nonvoter. Such discrete 
classes must be recognized and are usefully dealt with in research as well 
as in public affairs. Classification, then, is a very useful and necessary 
process in science as well as in practical life. It is the procedure by which 
objects become categorized for counting. 

Some Psychological Categories.—Before specifying the way in which 
categories should be set up and utilized, it may be well to have in mind 
some examples of the more common kinds from the field of psychology. 
In experimental psychology, particularly in psychophysical studies, we 
have categories“of judgment. The second of a pair of stimuli is judged as 
“greater than,” “equal to,” or “less than” the first. In public-opinion 
polling, responses are obtained in a small number of categories that are 
intended to be meaningful for interpretation purposes. In answer to the 
“ Are you in favor of the Marshall plan?” the response might be 
» “J do not know what the Marshall plan is,” or “I know 
what the plan is but I am undecided.” In taking a vocational-interest 
test the examinee may be required to respond in one of three categories, 
“J,” (for like), “I” (for indifferent), or “D” (for dislike) concerning the 
thing proposed. In a problem-solving experiment with rats, after some 
preliminary observations, solutions might be categorized as falling into 
one of four types. Clinical types in psychopathology are categories 
mostly of long-standing recognition. And so one could continue. Many 
categories used in research are not static; they change as new light is 
thrown on the field of study. Some categories are invented for temporary 
duty as provisional scaffolding upon which to arrange data for’ better 


question, 
t Yes,” ec No, 


inspection. / ; 
There is not space here to give detailed instructions on how to choose 


or to construct useful categories. It may suffice to say, and it may seem 
trite to do so, that categories should be well defined, mutually exclusive 
(if possible), univocal, and exhaustive. The importance of good definitions 
cannot be overestimated. Making proper assignment of cases to classes 
depends upon it, Being understood by one’s colleagues also depends 
upon it. A prime requirement of scientific findings is that they shall be 
subject, see Peatman, J. G. Descriptive and sampling 


1 For further details on this 
statistics. New York: Harper, 1947. Ch. 2. 


16 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


communicable to others. Other investigators should be able, if they so 
desire, to repeat our operations to test our results. The requirement of 
mutual exclusiveness is perhaps the most difficult to achieve. Lack of it 
probably means something is missing in defining the basis of classifica- 
tion. Lack of it means some overlapping, interdependence, and loss of 
power to draw clear-cut conclusions. A set of unique categories means 
that there is one and only one basis of classification. To group school 
children into three classes, boys, girls, and Mexicans, is to inject two 
principles or bases: sex difference and race difference. Perhaps anything 
as grossly absurd is easily avoided; it is the more subtle confusion of 
variables that causes trouble. By being exhaustive, a set of categories 
provides a place for all cases. If there are only two classes, such as 
delinquents and nondelinquents, and if they are well differentiated by 
objective criteria, even two categories can be exhaustive. In many a 
system, particularly when more than two classes are needed, there is 
often a necessity for one miscellaneous group. This group is distin- 
guished merely on the basis of failure to place its members anywhere else, 
These cases are often ignored, but if they are numerous it probably means 
biased sampling in other categories. It also probably means lack of 
adequacy for the classificatory system as a whole. 

Qualitative and Quantitative Categories—Most of the examples of cate- 
gories given thus far have been what we call qualitative. The classes of 
objects are different in kind. There is no reason for saying that one is 
greater or less, higher or lower, better or worse than another. The basis 
is some qualitative attribute. There may be some intrinsic or some 
external basis for thinking of the classes as being ordered on a scale of 
more or less, but if so, we are unaware of it. There are, however, many 
classifications in which the groups can be ordered according to quantity 
or amount. It may be that the cases vary continuously along 
tinuum that we recognize but on which we cannot yet make m 
ments for lack of an instrument; we can only group in a gross manner. 
Ratings on a scale of five points (and even more) may well be regarded 
as such a categorizing. In such situations, the categories cannot be 
defined, perhaps, in any independent terms. Each one may be distin- 
guishable merely by the fact that similar groups of cases are in it and 
these differ notably from members of other classes. 

Another instance is where the experimental controls are in graded steps. 
Five groups of subjects receive different amounts of instruction of a cer- 
tain kind. In selection by means of tests, examinees are categorized into 
the accepted and the rejected groups. Later, after training or service on 
the job, there is a further classification between those who are Satisfactory 


a con- 
easure- 


COUNTING AND MEASURING 17 


and those who are not. Experimental and technological prattice is full of 
suchexamples. Later chapters will explain methods for dealing with them. 
The very next chapter will show how metric data are most conveniently 
handled by somewhat arbitrary groupings in successive categories. 

Frequencies, Percentages, Proportions, Ratios.—A frequency has 
already been defined as the number of objects in a category. There 
are some other related concepts that, though common in advanced arith- 
metic, most students do not appreciate fully. They play an important 
role throughout this volume. We cannot review all the arithmetical 
features of these concepts here, but there are certain new uses of them 
that should be stressed and certain pitfalls to be pointed out. 

Let us consider an example to. illustrate the use of percentages. In 
Table 2.1 are given some original data in the form of frequencies in 12 


TABLE 2.1.—ELIMINATION RATES FOR BOMBARDIER STUDENTS OF THREE LEVELS OF 
` APTITUDE IN Four Army Arr Forces Trarninc ScHooLs* 


Aptitude level 


Low Moderate High All levels 
School 

Num- | Num-| Per | Num-|Num-] Per | Num-|Num-| Per | Num- | Num- | Per 

ber in} ber |cent|berin| ber | cent}berin] ber | cent|berin| ber | cent 
train- | elimi- Jelimi-| train- | elimi- |elimi-| train- | elimi- lelimi-| train- | elimi- elimi- 
ing | nated |nated| ing | nated |nated ing | nated |nated| ing | nated |nated 
A 62 | 26 | 41.9} 340] 105 | 30.9] 162] 29 | 17.9] 564| 160] 28.4 
B 69 23 33.3 274 Si | 18.6) 125 10 8.0 468 84 | 17.9 
C 69 20 29.0} 334 43 | 12.9) 166 15 9.0 569 78 | 13.7 
D 139 21 15.1 274 19 6.9) 149 9 a 6.0 562 49 8.7 
All schools. .| 339 90 26.5| 1,222) 218 | 17.8| .602 63 10.5| 2,163) 371 17.2 


* Aptitude was measured in terms of a composite score on psychological tests. The data were 
selected from results during the early months of World War II. (Adapted from unpublished data 
of the AAF Training Command. This will be true of other AAF data used in this volume unless 
otherwise specified.) 


categories. The categories are in a two-way classification, one qualitative 
and the other quantitative. The data pertain to the number of students 
in training and the number of these eliminated in each of four bombardier 
schools in the Army Air Forces during the early part of World War II. 
In each school the students had been categorized in three levels as to 
aptitude. The categorization by schools is qualitative and that by apti- 
tude is quantitative. Such a table would probably be set up to study 
the relation of elimination rate to aptitude and also to differences between 


18 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


schools. We can make comparisons both ways. There will be some 
comments, a little later, on how to prepare a good table. Here we are 
interested in another point: the use of percentages. 

Percentage as a Rate Index.—lIf we wanted to compare schools as to 
eliminations, the umber eliminated in each school would be a poor index, 
particularly when our comparison is made at somewhat constant levels of 
aptitude. For example, at the low level of aptitude, the numbers of 
eliminations were not very different: 26, 23, 20, and 21. If we gave 
credence to such small differences, we should place the schools in the 
rank order A, B, D, and C, from most to least eliminations. Schools 4, 
B, and C had comparable numbers in training, but school D had about 
twice as many. This makes us suspicious of the use of mere numbers 
eliminated as the way to compare schools. To put the schools on a fair 
basis we need to find an index of elimination rate. We should ask what 
the elimination “scores” would have been if all schools had had equal 
numbers in training. If we assume that common number in training to 
be 100, the number eliminated per hundred is a familiar percentage. The 
percentages of eliminations for students of low aptitude are 41.9, 33.3, 
29.0, and 15.1. Twenty-six is 41.9 per cent of 62; 23 is 33.3 per cent of 69; 
and so on. Now we see that there are larger differences (this is partly 
because three of the denominators, 62, 69, and 69, are less than 100) 
between schools and the rank order is now A, B, C, and D. The inversion 
of the order of C and D is decisive; at least D’s position below C now 
seems decisive. The point of this illustration is that percentages are used 
to compare groups of objects on an equitable basis. Frequencies alone 
will not do when such comparisons are to be made. 

Some Limitations to the Use of Percentages. —Some precautions should 
be pointed out concerning the use of percentages. Ideally, a percentage 
of any number less than about 100 should be computed with hesitation. 
If the number is less than 100, a change, by chance, of only one case 
added to or removed from a category would mean a change of more than 
one per cent. If we ask what per cent 15 is of 25, the answer is 60. 
But if the frequency were to gain one, the percentage would be 64. Ifa 
lower limit must be mentioned as a total below which computation of 
percentages js unwise, it might be placed at 20. At this number, a change 
of one case would mean a corresponding change of 5 per cent. This is 
being quite liberal for the sake of applying a very useful index. 

In line with the discussion above, it would seem to be not very mean- 
ingful to report percentages to any decimal places unless the total number 


of cases exceeds 100. When we want a percentage for use in further 


computations however, it would be wise to retain at least one decimal 
? 


COUNTING AND MEASURING 19 


place. Frequencies are “exact” numbers (see p. 33), and percentages 
based upon them are accurate to as many decimal places as we wish to use. 
They thus describe the sample in terms of per hundred. It is when we 
become interested in letting an obtained percentage stand for a popu- 
lation value (see Ch. 9) that we must become conservative about report- 
ing it. In Table 2.1 all percentages were reported to one decimal place 


because most of them were based upon totals greater than 100 and all 


were made consistent. Consistency of this sort carries some weight, but 


should not be pushed too far. 
When a percentage turns out to be less than 1.0 (¢.g., .2 per cent), 


it is not so meaningful as larger ones, and what is worse, it may be mis- 
taken for a proportion (all proportions are less than, if not equal to, 1.0). 
In some social statistics a series of percentages may be this small. In 
this case it is common practice to change the base from 100 to 1,000 or 
even more, e.g., to report 15 deaths per 100,000; 5 cases in a thousand; 
and the like. As percentages these would read .0015 and .5, respectively. 
To avoid confusion with proportions, these should be written as 0.0015 per 
cent and 0.5 per cent. 

Proportions.—Whereas with percentages the common base is 100, with 
proportions the base, or total, is 1.0. A proportion is a part, or fraction, 
of 1.0. A proportion is 1/100 of a percentage, and a percentage is 100 
times a proportion. Careless individuals often call a percentage a pro- 
portion and vice versa. By definition, and in all strictness, they are 
different concepts. The symbol used for percentage is capital P; for pro- 
portion the symbol is a lower-case p. ‘This should help to fix the idea of 
the relative sizes of the two. The proportion of eliminees among low- 
aptitude students at school A was .419 (see Table 2.1); for high-aptitude 
students at school B the proportion of eliminees was .080. 

As compared with percentages, proportions have some advantages as 
well as disadvantages. They are less familiar to nonmathematical indi- 
viduals than are percentages. Whenever results are reported to the 
general reader, then, percentages are almost always to be preferred. Per- 
centages have another advantage in that we can speak of percentage of 
gain or of loss. Proportions are always parts of something and can never 
exceed the total, which is 1.0. They have no place in expressing gain or 
loss, though presumably losses could be expressed in terms of proportions 
if we chose, for losses cannot exceed the total; but we never use a pro- 
portion for this purpose. 

The advantages of proportions are best seen in later chapters. They 
are used more than percentages, in connection with the normal distribu- 
tion curve, in connection with item analysis of tests, and with certain 


20 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


correlation methods, and so on. It has already been said that percen- 
tages may be mistaken for proportions when they are less than 1.0. 
Since proportions can never be greater than 1.0, they are much less likely 
to be mistaken for percentages. 

Probabilities —Another advantage of proportions is their relation to 
probabilities. Every probability can be expressed in the form of a pro- 
portion. We say that the probability of getting a head in tossing a coin is 
1/2 or 1 chance in 2. This is a more manageable figure if expressed as a 
probability of .5. We say that in throwing a die the probability of getting 
a six spot is 1 in 6. Expressed as a proportion this is .167. In general, 
for computation purposes, decimal fractions are much preferred to com- 
mon fractions; they are much more easily manipulated in addition and 
subtraction and in finding squares and square roots. The interchange- 
ability of proportions and probabilities will be found to be a very common 
occurrence in the later chapters. 

Ratios.—A ratio is a fraction. The ratio of a to b is the fraction a/b. 
A proportion is a special ratio; the ratio of a part to a total. We may 
also have ratios of one part to another. For example, there were 69 low- 
aptitude students in training school B (Table 2.1), of whom 23 were 
eliminated and 46 were graduated. The ratio of graduates to eliminees 
was 46/23, or 2 to 1. This ratio can also be expressed as 2.0. The ratio 
of eliminees to graduates was 23/46, or .5. This could also be expressed 
as .5 to 1, but ordinarily is not. At any rate, in a ratio the base is 1.0, 
as it is in a proportion. The chief difference is that a proportion is 
restricted to the ratio of part to total, whereas ratios are not. 

Ratios are useful as index numbers. ‘They describe rates and relation- 
ships. The /Q is an index number of rate of general mental growth—the 
ratio of mental age to chronological age (multiplied by 100). Compari- 
sons of incomes of regions are made in terms of per capita—the ratio of 
total income to population. Costs of education are more meaningful if 
stated in terms of dollars per pupil per day attended rather than in terms 
of total sums of expenditures. In dealing with index numbers one should 
keep in mind the operations by which they were derived. It sometimes 
makes a difference when they are used in computation as in averaging 
L Lees in correlation problems (see pp. 355 and 358). 
ete, aa of paa Eye student who writes a report based upon 
Talens = re problem of how best to organize them in tables. 
opal data os pining There are tables that list the raw or 
vidal ani £ oF scores in several tests earned by different indi- 

an example. Although these may be very long in some 
reports, many readers like to see them presented i 
n full so that they may 


COUNTING AND MEASURING 21 


apply checks or perform other operations than the investigator used. One 
common way to present these tables is in an appendix to the report. 

A second type of table is a summarizing device. It is used to present 
an organized and curtailed picture of what is in the original data. It 
includes such descriptive statistics as means, standard deviations, and 
the like, with the data grouped in one or more meaningful ways. Table 2.1 
is an example of this type. All the essential information is there. Such a 
table should tell a complete story of its kind. It should be given a title 
that tells clearly what the table is about. If the title becomes too long it 
is better to relegate to a footnote some of the secondary information. 
Headings of columns and rows should be descriptive, and their spacing 
and the lining should show clearly to what columns or rows they belong. 
A table should be so labeled that the reader need not turn to the text 
material in order to know what is there. 

How to Prepare Tables.—The organization of such a table, in columns 
and rows, should take into consideration, first, what are the main points 
that should be brought out. In Table 2.1 probably the more important 
comparison to be made is that of the different schools. A person con- 
cerned with the administrative aspects of bombardier training certainly 
would think so. One who is concerned with the development of aptitude 
tests would, of course, be interested in the other problem, the relation of 
elimination rate to aptitude level. In the latter case, a distinction between 
schools would be of little importance. Having decided which relationship 
is of most interest, the data should be arranged so that the comparisons 
one wants to make most are easiest to observe. Here let us say we want 
to compare schools. The best basis of comparison is in terms of elimi- 
nation rates. The four elimination rates are in an uninterrupted column. 
Comparison of elimination rates for different aptitude levels is more 
difficult because other numbers intervene. 

A second consideration, and it is of less importance, is the practical one 
of keeping the dimensions of the table consistent with the dimensions of a 
page. Columns can be longer than rows; consequently, considering the 
space available for headings and the widths of numbers, we can fit the 
data in the available space. With small tables this is no problem. Ordi- 
narily, long lists go better in columns and short lists in rows. Another 
consideration is the psychological fact that horizontal eye movements are 
easier and more natural for a reader than are vertical movements. All 
these considerations must be weighed and balanced against one another. 

A third type of table is a final, summarizing one. This brings together 
the salient findings from several tables. The second type may, of course, 
serve the same function; it all depends upon the scope and nature of the 


Tu We. 


p t 
Liktary © 


8.C. 
i ER T., West Benga) 


22 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


study. If there is a final-type table, however, it serves as a basis for 
major conclusions of the study. 

Graphic Representation of Data.—The graphic representation of data 
has become such an extensive art that it is possible to provide only an 
introduction to the subject here. A few fundamental principles will be 
mentioned and illustrated. A “picture may be worth ten thousand 
words” but only if it is properly done. The first requirement is that it 
shall tell a complete story for what it is intended to convey. 


50 ———_—— 
LA school A 
#0 (I) seroo/ 8 
È ERZA School C 
Y 
N 30 169 
$ 
v 
N 
9 20 E 
E ees 
ef) i 
10 a rae 
—AWeee 
ae 
ey 
0 bees 
Low-aptitude Moderate-aptitude High-aptitude 
Students SF S oent 


* § 
Numbers like this represent totals in 


Fic. 2.1.—P, training in various groups. 
- 4.1.—Percentage of bombardier students eliminated from training in four dierent 


Army Air Force school i i ata 
at three different apttade oes early part of World War II. Comparisons are 


p Bar Diagrams.—Probably the most common type of figure for display- 
ing frequencies or percentages for categories is the bar diagram. It is 
very adaptable to many purposes and arrangements. 

Figures 2.1 and 2.2 are designed to represent the data of Table 2.1. 
In these examples, the bars are in the vertical position, but bars can Ble 
be placed in the horizontal position (Figs. 2.3 and 2.4). In Fig. 2.1 the 
data are grouped so as to show best a comparison of the different schools. 
There are three groups of bars, one for each level of aptitude of students, 
and within each group every school is represented. In each case, the 


same kind of shading is used for the same school. The schools were 


i i i iminati te. They should be 
arranged, in gener al, in their order of elimination ra y 


COUNTING AND MEASURING 23 


in the same order in the three groups. This facilitates cross comparisons 
between aptitude levels and gives an idea of trend within each group. 
Figure 2.2 was designed to emphasize comparison of elimination rates 
as dependent upon aptitude level. There are four groups, one for each 
school, with three bars in each group. Here the quantitative nature of 
the aptitude variable determines the order of the three bars in each group. 
In both diagrams, note that the numbers of students in training are 
given at the tops of the bars. The statistically minded reader will want 


7 Low-apti 
Tsai 


40 Mogerate-aptitude. 
students 

dD P 

2 J #igh-aptitude 

Q 3S students 

‘$30 

È 

N 

o 

Ñ 

$20 

È 


School A School B Schoo! C School D 
*Numbers like this represent totals in training in various groups 


Fic. 2.2—Percentage of bombardier students eliminated from training at three different 
levels of aptitude. Comparisons are made in four different bombardier schools of the 
Army Air Forces during the early part of World War II. 


to know these values as a basis for judging about how reliable each per- 
centage is and whether differences he sees in the bars are probably genuine 
or are perhaps due to chance. He cannot be sure about these questions 
unless he applies some procedures described in Ch. 9, but he can get 
a rough idea just by knowing the total numbers and by general past 
experience. 

Figures 2.3 and 2.4 show some data in response to the question, “How 
many times did you feel afraid while flying on a combat mission?” applied 
to aircrew personnel just returning from combat to redistribution stations 
in the United States. The categories of responses were “Every time, 


1 From the publication, Wickert, F. (Ed.). Psychological research on problems of 
redistribution, AAF Aviation Psychology Program Research Reports, No. 14, 1947. 


24 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


or almost every time,” “About 14 to 34 of the times,” “One to three 
times,” and “Never.” This is not the place to question either the method 
or the validity of the responses. We are merely illustrating a Statistical 
device. In Fig. 2.3 the bars are designed to compare officer with enlisted 
aircrew personnel. For each category of response the bars for these two 
kinds of personnel are shown juxtaposed. The numerical percentage 
values are also written in so that the reader will have the more accurate 
information that numbers provide if he wants it. The sizes of samples 

How many times did ‘(you 

feel afraid while flying 


ona combat misan 2 ð z 30 40 50% 


Every time, or 
əlmost every time 


About Ya to Yy 148% 


of the times 


/to3 times 


Never 


0 10 20 30 D 50% 
Per cent giving each respons 
t i k en 
ise, ZARE a 
a ir Forces 
Fic. 2.3.—Percentages of officer versus enlisted personnel in samples of eral oe i 
omaha returnees who responded in specified ways to a question conce 


are given below the diagram so that the reader may have some basis for 
degree of confidence in the differences represented. ae 

Figure 2.4 Shows another arrangement of the same data. In this dia- 
gram we obtain a better conception of the proportions of reactions in each 
category for officers as a group and for enlisted men as a group, as well 
as some Possibility of comparing the two in each category because the 


two bars are Presented parallel and the category percentages in the same 
rank order. 


Pie Diagrams.—Another kind of picture that is sometimes Hagel case 
proportions of a total is the pie diagram. The 360 degrees of a foi kr 
subdivided in proportion to the number or percentage in each categ 


COUNTING AND MEASURING 25 


Figure 2.5 is an illustration. It shows the situation with regard to avi- 
ation cadets in the AAF with respect to three principles of classification: 
previous flying experience, marital status, and training preference. The 
number in the total sample is given below each diagram. The numerical 
percentage is written in each segment which is shaded differently from 


ANS ÒM NN 
NNN | 


(N=1985) 


SMES SS 
SS SS 
REN) 
1% 


Frightened every time, or NN Frightened / to 3 times 


nearly every time 


WHA ihre sou he | | Never tightened 


Fic. 2.4.—Percentages of responses of each type given to the question, “How many 
times did you feel afraid while flying on a combat mission?” by samples of officer and 
enlisted personnel who had returned from tours of combat duty in the Army Air Forces. 


License 


and divorced 


Previous flying Marital status Type of training 
exper/ence (N= 4,500) preferred 
(N= 7,826) (N=/2,000) 


Frc. 2.5.—Descriptions of the status of new recruits to flying training in the Army Air 
Forces during the carly part of World War II with respect to previous flying experience, 
marital condition, and type of training preferred. 


others in the same “pie.” The category name is also written in a seg- 
ment if there is room; if not, it is written just outside. 

The pie diagram is restricted to this kind of display; the proportions 
of a total. It is inferior to the bar diagram, such as that in Fig. 2.4 
(which also demonstrates proportions of wholes), when we want to com- 
pare the same categories in two samples. 


Trend Charts —When showing changes in frequencies, percentages, or 


proportions over a period of time, a trend chart or belt graph is desirable. 


26 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


One could show a bar for each sample and place the bars in time order, 
but this would not picture changing conditions nearly as well as some- 
thing continuous. Fig. 2.6 is drawn to represent such changing condi- 
tions or trends ina certain situation. The data are in terms of percentages 
of aviation students interviewed, who were subsequently recommended to 
different types of assignment. The data arose from the psychological 
unit at one classification center during World War II and cover a period 
of 15 months during the last part of 1942 and the first part of 1943. 


o N=1285 1828 2152 2334 3528 


10 100 
90 90 
80 
> 80 
v 
70 70 
: 
60 
Š 60 
8 50 
l 50 
R 
© 40 40 
8 
5 30 30 
È 
20 20 
3 10 10 
0 0 
Quarter: I] F Z Wa 7a 
maven: ot 13 i 
- 2.6.—Trend in the percentages of interviewed aviation students in the Army Air 


Forces wh i 
o P . * al erio! 
of World Wari recommended for various assignments during a fifteen manth p 


Š . Par h Program, 
R (Adapted from data in the AAF Aviation Psychology Researe 
Phort No.2, The classification program, P. H. DuBois, Ed. P. 346.) 


Observations were 
students interviewe 
tude scores and ex 
not obvious under 

In some trend 
representing popu 
In connection wit 


grouped by quarters, or three-month periods. The 
d were those whose classification on the basis of apti- 
pressed preferences for different types of training was 
the prevailing regulations at the time. 

charts the frequencies are plotted—for example, those 
lation growths or those representing changes in income. 
ù with the data of Fig. 2.6, we are not interested in numbers 
but, for administrative Teasons, in proportions of students disposed of in 
each of four ways, for assignment to one of three types of training or to 
ground duty. The reasons for any trends are, of course, not obvious from 
the picture itself, but knowing the picture, a study of the situation would 


COUNTING AND MEASURING 27 


probably yield an explanation of the causes and suggest, if necessary, 
corrective measures. 

There are other trend charts of various kinds. In a broad sense, all 
curves of learning and retention would be included. Their nature is so 
well known that they need not be described here. 

Pictographs —The layman, who is probably not interested in statistics 
or numbers, can be induced to read reports and to gain impressions the 


52/5 Cases 377/ Graduates 72% Graduation 


Total % Graduation 


snes ebingregebveyepegregegel 
8 357 92 eLresiesegegeaea/egeare 


p 7 oe e 
> ons n epepepyeyeyeyey eys 
sm nepeyeyeyeyeyeye 
` am u Ore grasa eR ese 
sow ebegegegeye 

s m u Bebra grees 

| _137, 

r 


Fic. 2.7.—Percentage of pilot students at each aptitude level who graduated from 
primary training in one sample of Army Air Forces trainees during World War Il. 
(From Aircrew selection and training, a publication of the AAF Training Command 


Headquarters, 1944.) 

writer wishes to make, if the picture is dressed up in terms of concrete 
objects. Figure 2.7 is one example that was used to display to the aver- 
age reader the relationship that existed at the time between graduation 
rate and aptitude of pilot students in the AAF. It requires a minimum 
of statistical sophistication to interpret such a picture, and the cartoonish 
e drawings attracts attention and interest. Very effective 
tatistical results to the general public is done in this manner. 
f ways in which this can be applied are limited 


quality of thi 
reporting of s 
The number and variety 0 
only by the ingenuity of the reporter. 


28 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


MEASUREMENTS 


Some Examples of Psychological Measurements.—In order to make 
our discussion concrete and specific, let us consider some typical examples 
of measurements commonly made by psychologists. Perhaps the first 
examples that come to mind are scores on tests of mental ability. These 
are usually in terms of the number of correct responses to test items. A 
similar kind of measurement is seen in scores on a personality question- 
naire or a vocational-interest inventory. In these cases it is not the 
number of “correct” responses but the number of responses indicating 
the same interest or trait, often weighted in proportion to their supposed 
diagnostic value. Also in the area of mental tests we find the frequent 
reference to “chronological age,” “mental age,” and that ratio between 
the two, the “intelligence quotient.” 

In the experimental laboratory as well as in the clinic, we frequently 
measure in terms of the time required to complete a specified test or task. 
In memory experiments, we measure learning efficiency in terms of the 
number of trials to attain a certain standard of performance or in terms 
of the “goodness” of performance at the end of a certain trial or time. 
We measure efficiency of retention in terms of the time required for 
relearning (overcoming the forgetting that has taken place) and the 
efficiency of recall in terms of association time or in terms of the number 
of items correctly recited. 

In the sphere of motivation, we gauge the strength of drive in terms 
of the amount of punishment (electric shock) an organism (for example, 
a rat) will endure in order to reach his immediate goal or in terms of the 
number of times he will take a constant punishment in order to attain 
the Same result. The difficulty of a task or test item can now be specified 
in quantitative terms, as can the affective value (degree of liking or dis- 
liking) for a color, a sound, or a pictorial design. In studies of sensory 
and Perceptual powers, the threshold stimulus and the differential limen 
are given in terms of stimulus magnitudes. The span of perception or of 
apprehension is given in terms of the average number of items that the 
observer can report correctly after momentary exposures. The galvanic 
skin response, the pupillary response, and the amount of salivation also 
serve as quantitative indicators of amounts of psychological happenings. 

dave Examples of Educational Measurement.—Many an educational 
problem is also a Psychological problem, and its mode of measurement 
has been indicated in the preceding paragraphs. Achievement inany area 
of learning, like any mental ability, is measurable in terms of test scores. 
Marks, however obtained, have been the traditional mode of evaluating 


COUNTING AND MEASURING 29 


students in specific units of formal education. Attendance records, data 
on size of classes, on budgets, on supplies, and on other material aspects 
of the well-regulated school system compose another list of measurements 
in education. Outcomes of educational effort are often expressed quan- 
titatively in terms of promotion statistics, achievement ratios, and esti- 
mates of teaching success. Whether for purposes of research in education 
or for systematic and meaningful record keeping, statistical methods 
become indispensable tools. 

Some Different Kinds of Measurement.—In a superficial way, it is 
easy to see, as one glances over the list of psychological and educational 
measurements just mentioned, that there are different kinds of measure- 
ment involved. Among the psychologist’s measurements, some are in 
terms of the stimulus—for example, the threshold stimulus or stimulus 
difference; the number of syllables or items; the amount of electric shock, 
etc. Others are in terms of the amount of response—for example, time of 
the response; number of responses or of correct responses; degree of the 
response, etc. Some measurements are more direct, such as reaction time, 
and others more indirect, such as affective value and difficulty. Some 
measurements are in terms of discrete units—number of individuals, 
syllables, words, items, crossings—and others are in terms of continuous 
scales—age, time of response, amount of punishment, and degree of effort. 
In the discrete type of measurement, things can increase or decrease only 
by changing one whole unit at a time, whereas in the continuous type, 
the increase or decrease can be by as small a fraction of a unit as one 
pleases and can distinguish. Although this difference has a logical sig- 
nificance, in statistical practice, actually, we generally treat discrete and 
continuous measurements in the same manner. 

Rank Orders and Other Measurements.—In a most general sense, we 
make a measurement whenever we assign numbers to things in such a 
way that those things are placed in order. Suppose we place three boys, 
Charles, Bob, and David, in rank order for height, Charles, being rank 3 
(tallest) and David, rank 1 (shortest). The numbers 3, 2, and 1, attached 
to Charles, Bob, and David, give us some useful information, such as the 
inference that Charles is taller than Bob and that Bob is taller than David. 
These numbers do not tell us much more. Since they are merely ranks, 
we cannot say that Charles is as much taller than Bob as Bob is taller than 
David. We cannot say that Bob is two times as tall as David or that 
Charles is three times as tall as David. Measurements in terms of rank 
order simply give us the serial arrangement of things. 

As we saw from the example just given, we are not at liberty to add 
and subtract or to multiply and divide such numbers. Had we actually 


30 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


applied a meter stick to these three boys and found that their heights 
were: Charles, 195 cm., Bob, 180 cm., and David, 150 cm., matters would 
be different. Now we can make some further deductions about the heights 
of these boys. We can say that the difference between Bob and David is 
two times that between Bob and Charles. Knowing that Charles is 
15 cm. taller than Bob and that Bob is 30 cm. taller than David, we can 
infer that Charles is 45 cm. taller than David. We can say that Bob is 
20 per cent taller than David and that Charles is 30 per cent taller than 
David. It is apparent that we can now perform all the arithmetical oper- 
ations of addition, subtraction, multiplication, and division with the three 
numbers assigned to the three boys. 

Best Measurements Require an Equal Unit and an Absolute Zero.— 
Some measurements obtained in psychology and education are compara- 
ble with the measurements of height (linear distance) just mentioned, but 
most are not. Many measurements should be regarded as merely placing 
things in rank order until it is demonstrated that they give us more accu- 
rate information than that. We have something considerably better than 
rank order when our measuring scale possesses equal units. When this is 
true, a gain of a unit in one part of the scale is equal to a gain of a unit 
in any other part of the scale. We can then perform a number of differ- 
ent operations with numbers assigned to objects on such a scale that would 
otherwise be precluded, 

A measuring scale is not complete, however, unless it also has an abso- 
lute zero point. An example of a scale that has equal units but not an 
absolute zero point is the centigrade thermometer. The zero point is 
arbitrarily placed at the freezing point of water. With this instrument, 
we can say that the temperature of the weather changes as much when it 
rises from 0 to 25 as it does when it rises from 25 to 50. But we cannot 
Say that 50° is twice as warm as 25° or that 100° is twice as hot as 50°. 


ME can find differences between numbers on this scale and get sensi- 
le answers, but we cannot multiply and divide. If we translate our zero 
mark to the 


absolute zero point (zero heat), which in terms of the com- 
mon thermometer is —273°, then we can perform these operations. On 
the absolute Scale, our 25° becomes 298°, and our 50° becomes 323°. 
Now it is obvious that the higher of the two (323) is not two times the 
nomer (298). But if our absolute centigrade scale is correct, with regard 
to equality of units, we may well say that a temperature twice as hot 
physically as 298° is a temperature of 596° (also on the absolute scale). 


q Mental-test Scales as Metric Devices.—What shall we say of a 
m 


easuring scale of the type most frequently used in psychology and edu- 
cation—mental-test Scores in terms.of number of items correct? Have 


COUNTING AND MEASURING à 31 


we here a scale with absolute zero and equal units? Strictly speaking, 
usually not. A score of zero, no items correctly answered, does not mean 
zero ability. For had we included some easier items, even the lowest 
individual in the test could probably have made a score numerically 
greater than zero. Thus we are unable to say that a score of 50 points 
means twice the ability represented by a score of 25 or half the ability 
represented by a score of 100 points. For if our real zero-ability score 
should have been some 25 points below our arbitrary one, these three 
scores would then become 50, 75, and 125. 

Now the second is xot twice the first or half the third. Nor can we 
be sure that our units are equal within the range of scores obtained. 
Unless the units were equal, we should not be able to say that a score of 
100 is as far above one of 75 as the latter is above a score of 50. Asa 
matter of long experience, however, we find that test scores generally 
behave as if units were equal; as if one item correct adds an amount to 
the measurement of ability equal to that added by any other item correct. 
There are various indications that tell the experienced worker in statistics 
when his measurements probably possess equal units and when they do 
not. And when they do, we can proceed to apply most of the ordinary 
statistical procedures. When we strongly suspect that they do not, we 
can make adjustments or substitute other statistical methods that do 
apply. The beginner in statistical work need not be too much concerned 
about trying to decide the matter, but he should be aware that there are 
natural limitations to what one may do in the way of statistics and that 
most of our ordinary conclusions are sound only in so far as equal units 
(and much less often an absolute zero point) prevail in the measuring 
scale. 

How Numbers Should Be Regarded in Measurement.—Most measure- 
ments are taken to the nearest unit—nearest foot, inch, centimeter, or 
millimeter, depending upon the fineness of the measuring instrument and 
the accuracy we demand for the purposes at hand. In giving the height 
of a tree, measurement to the nearest foot—for example, 107 ft—would 
be adequate. In giving the height of a girl, we should resort to inches or ’ 
perhaps centimeters as our practical unit. In giving the length of a 
needle, we should probably report in terms of millimeters; and in giving 
its diameter as seen under a micrometer, we should resort to some smaller 
unit. In any case, we may notice that our object does not contain an 
exact number of our chosen units. Our tree is more than 107 ft. but is 
closer to 107 than it is to 108; our girl is not exactly 156 cm. but is closer 
to 156 than to 155; etc. The result is that our report of 107 for the tree 
means anything between 106.5 and 107.5 ft., and our report for the girl 


32 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


means anything between 155.5 and 156.5 cm. Figure 2.8 shows a graphic 
illustration of units and their limits. 

And so it is with most psychological and educational measurements. 
A test score of 48 is taken to mean from 47.5 to 48.5; and an obtained 
score of 70 means from 69.5 to 70.5. We assume that a score is never a 
point on the scale but occupies an interval from a half unit below to a 
half unit above the given number. We can make this seem more reason- 
able by arguing that the person making a score of 48 actually might be 
just a fraction of a unit better than 47.5 at the moment, and being better 
than 47.5 is sufficient to give him a whole score of 48. Or our individual 
might just fail to be as good as 48.5 on the same test, but, not being 
quite good enough to achieve 49 items, he falls back to 48. Although 
our tests are probably never so refined as to cause an individual to waver 
between fractions of a point (the margin of error is usually more than a 


67 68 69 70 7 <«——— “Its 


66.5 6 5 <«— Limits 
7.5 68.5 69.5 70.5 a of units 


0.8 0.9 1.0 LI 1.2 <——— “nits 


0.75 0.85 25 <—Limits 
0.95 1.05 1.15 L of units 


Fic. 2.8.—An illustration of two metric scales, showing selected units and their limits. 


whole point), this kind of argument rationalizes our procedure from one 
standpoint. 

A more important practical consideration dictates the taking of a score 
as occupying a whole interval on the scale, as the student will appreciate 
later. If we did not do this, an average computed from a set of ungrouped 
measurements would not be consistent with one computed when the same 
measurements are grouped. Even in dealing with discrete measurements, 
as, for example, the number of children in a family, we customarily pro- 
ceed as if 8 children meant anywhere from 7.5 to 8.5. The only notable 
exception to this general rule is in dealing with chronological age as given 
to the last birthday and the like. Then a twelve-year-old child is any- 
where from 12.0 to 13.0. If ages are given fo the nearest birthday, how- 


it T rule again applies, and a twelve-year-old falls in the interval 11.5 
o 12.5. 


Some RULES REGARDING NUMBERS 


Approximate and Exact Numbers.—Measurements, when taken to the 
nearest whole unit, are known as approximate numbers. They are always 


COUNTING AND MEASURING 33 


“fuzzy” and are of uncertain value within the unit where they fall. When 
we find a number by enumeration of discrete objects, we have an exact 
number; for example, 15 men, 42 letters, or 50 pencils. The distinction 
between exact and approximate numbers we shall find important when 
they are used in calculations. Some rules about calculations are pre- 
sented next. They would be unnecessary if all numbers in statistics 
were exact. 

How to Round Numbers.—The beginner in statistical computation 
invariably asks, “How many decimal places shall I save?” In just this 
form, the question cannot be answered. The question should read instead 
“How much accuracy have I in the answer?” A number may have been 
rounded, dropping all digits to the right of the decimal point, yet not all 
of the remaining figures may be accurate. Another number may have 
four places remaining to the right of the decimal point, yet all of them 
may be accurate. Some students may, if they lack good rules, drop too 
many figures, thus losing much of the accuracy that they really have; 
others may save a string of figures beyond the limit of accuracy, giving 
the appearance of great exactness that is really fictitious. 

First let us be clear as to the proper way to round a number. There is 
no particular difficulty in rounding to the nearest whole number; 15.7 
becomes 16, and 27.4 becomes 27; 9.6 becomes 10, and 0.96 becomes i 
In rounding to two decimal places, the same principles apply; 2.1827 
becomes 2.18, and 91.2179 becomes 91.22. It is when the first digit to 
be dropped is 5 that difficulties arise. In rounding to two decimal places, 
again, the number 7.1654 becomes 7.17, and even 7.16502 becomes 7.17 
rather than 7.16, for the reason that the decimal fraction beyond the 6 is 
greater than just .00500. Had the number been 7.16499, we should have 
rounded to 7.16, because it is a shade closer to 7.16 than to 7.17. 

When the number is 7.16500 (equidistant between 7.16 and 7.17) we 
follow an arbitrary rule that when the digit preceding the 5 is an even 
number we leave it as it is but when this number is odd we raise it to 
the next digit. Thus 7.16500 would be rounded to 7.16, but 7.17500 is 
rounded to 7.18. The main reason for this is that when such numbers 
are summed, in a long series, we should have had by chance as many that 
were raised a half point as were lowered the same amount, and the changes 
will tend to compensate for one another. 

A word should be added about leaving a rounded number ending in the 
digit 5. For example, the number 6.21499 rounded to three decimal places 
becomes 6.215. Were we to round this further, following our rule, we 
should have 6.22. In view of the original number, this would be incorrect. 
It would have been well to indicate when the number 6.215 was given 


34 FUNDAMENTAL STATISTICS IX PSYCHOLOGY AND EDUCATION 


that the 5 came by rounding upward or that the original number was less 
than 5 in the third decimal place. We can do this by writing it as 6.215— 
to show this fact. The number 42.5+ has been rounded from something 
greater than 42.50. Further rounding to a whole number gives 43, in 
spite of the odd-even rule offered above. 

How Many Significant Figures in a Number?—When a measurement 
is given as 107 ft., the number is not only accurate to the nearest unit 
but is also said to be accurate to three significant figures. In spite of 
the fact that this measurement was taken only to the nearest foot, the 
7 fixes the value between 106.5 and 107.5, which makes the 7 significant. 
If we had, instead, a measurement of 107.3 ft., there would be accuracy 
to the nearest tenth of a foot and four significant figures. The .3 added 
to the number now fixes the measurement between 107.25 and 107.35 ft., 
tying the last place to the .3 ft. z 

The number .00156 has just three significant figures or digits. shey 
are the only ones that tell us about the numerical value, the two zeros 
being required merely to locate the position of the decimal point. The 
number 15600, likewise, has only three significant digits, again the two 
zeros merely being used as “fillers” to locate the decimal point. If this 
were given as the approximate cost of a certain boat in dollars, we should 
conclude that the cost was anywhere from 15550 to 15650 dollars. But if 
it had been written as 15600., with a decimal point after the last zero, this 
would indicate that measurement was to the nearest unit, or within the 
limits of 15599.5 to 15600.5 dollars. 

When zeros come between other digits, they count as significant figures. 
Thus 1002.1 has five significant figures, and .071021 also has five. Any 
other zero not used to fix the decimal point is also usually significant, 
as in .420, which has three significant digits, since the last digit fixes the 
number between .4195 and .4205. A lone zero before the decimal point, 
however, as 0.41, is not significant, since it adds nothing to our informa- 
tion concerning numerical value. z 

Rules Governing Significant Figures in Computation —The following 
rules will determine how many significant figures there are in a number 
found by computation. pre 

1. In Sums of Numbers. Case I.—When all the numbers added are 


regarded as accurate to the nearest unit, the sum is regarded as accurate 
to the nearest unit, 


Example: 47 +. 161 + 5,171 = 5,379, a sum that is accurate to the nearest unit and 
that has four significant figures. 


A similar case occurs when all the numbers added have the same 
number of decimal places. 


_ a 


COUNTING AND MEASURING 8) 


Example: 2.91 4- 40.22 + 0.07 = 43.20, where the answer is accurate to the second 
decimal place because all the numbers were accurate to that place. 


Case Il.—When numbers that are not accurate to the same number 
of places at the right of the decimal point are added, the sum is accurate 
only as far as the number having the smallest number of decimal places. 


Example: 17.257 + 142.1 + 75.47 = 234.8, which is rounded from 234.827, Note 
that the rounding was done after summing and not before. 


A similar rule is true when numbers rounded to the /eft of the decimal 
point are summed. 


Example: 75,000 + 3,845 = 79,000, which is rounded from 78,845 because in the 
first number there are only two significant digits to the left of the hundreds place. 


2. In Differences. Case I—If the two numbers are accurate to the 
same digit at the right, the difference is also accurate that far to the right. 


Example: 173.24 — 98.84 = 78.40, the zero being significant. 


Frequently a difference is drastically reduced in the number of signifi- 
cant figures, so much so that further computations with this difference 
are sometimes lacking in desired accuracy. ` This situation is to be avoided 
when possible. 


Example: 4.692 — 4.685 = 0.007. 


Case IL—As with addition, the answer is accurate no further to the 
right than is the number whose accuracy extends less far to the right. 
In the following examples, the answers are rounded to as many signifi- 


cant figures as are accurate. 


Example: 175.1 — 82.715 = 92.4 (not 92.385). 
Example: 5,200 — 829 = 4,400 (not 4,371). 


In both these cases, contrary to the practice in summing numbers, the 
rounding can just as well be done before subtracting, for the result will 
be the same either way. 


3. In Products of Numbers. Case I.—The product of two approximate 
numbers has no more accurate significant digits than has the number with 


the smaller number of significant digits. 


ple: 41.57 X 1.3 = 54 (not 54.041). 


n exact number times an approximate num- 
ificant figures than has the approximate 


Exam 


Case IL—The product of a 
ber has no more accurate sign 


number. 


36 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Example: 24.091 X 22 = 530.00 (where 22 is an exact number). 
Example: 24.09 X 72 = 1,734 (where 72 is an exact number). 


Case III.—The product of two exact numbers is accurate to all obtained 
digits. 


Example: 175 X 42 = 7,350 (which may be written as 7,350.). 


4. In Quotients. Case I.—The quotient of two approximate an 
has no more accurate significant digits than the one having the sma 
number of significant digits. 

Example: 7.182 + 2.3 = 3.1 (not 3.12261). 

Example: 4.07 + 0.2815 = 14.5 (not 14.458). 

P E mber 

Case II.—The quotient from an exact and an approximate nu 


: Fond roximate 
contains no more accurate significant numbers than the app 
number. 


Example: 7.1025 + 22 = 0.32284 (where 22 is an exact number). 


i as 
Case III.—The quotient of two exact numbers may be written to 
many significant figures as one wishes. 


5. In Squaring a Number.—Since this is a matter of atip ang ^ 
number by itself, the same rules as those governing products wil app A 
In general, the Square of an approximate number contains no more accu 
rate significant figures than the number itself. f an 

6. In Square Roots of Numbers. Case I—The square Toot F ant 
approximate number contains roughly the same number of iE DA 
figures as the number itself. The square root of 85.7, for example, may 
be taken appropriately to be 9.26, to three significant figures. : E 

Case II.—The Square root of an exact number may be given 
many places as one wishes, 


: d it to 
Example: V5 = 2.2361. This could be carried further, or we could roun 
2.236 or to 2.24, depending upon our purposes. 


In many Statistical problems which the student will encounter, ie 
Square root of a number of persons or observations will be utilized Mie 

h. 9 Particularly), The number of discrete objects is an exact num e : 
thus the Square root can be carried as far as one wishes. ee ae 
to follow is to think how many significant digits are needed for pas 
Pe eae | As a general suggestion, one might use not less than 
Significant digits in such a square root. r 

Application of the Rules —Although the rules as just given ener. 
ble and Sound, one should use them as giden and not doloar SE 


COUNTING AND MEASURING 37 


ishly. One frequently has to use his best judgment and do the most 
reasonable thing. To follow the rules rigidly at every step of the way 
would sometimes introduce inaccuracies or else cause one to lose informa- 
tion that he really has and needs. One good general principle to follow 
is to carry along more significant figures through the successive steps of cal- 
culation than would be required for strict accuracy under the rules and with- 
hold the rounding of numbers until the final answer is obtained such as an 
arithmetic mean, a standard deviation, or a correlation coeficieni At 
the end of a solution, one may decide upon the extent of accuracy in the 
answer by applying the rules to every step in the series of numerical oper- 
ations. This is difficult in some problems because of the many steps. 
There are also other things to be considered in particular situations, such 
as the standard error (see Ch. 9) of the statistic computed. For these 
reasons further suggestions will be offered more appropriately later when 
we are dealing with specific cases. 

The student will now see the reason for the earlier statement (p. 33) 
to the effect that the question “How many decimal places shall I save?” 
cannot be answered very simply. The most important things to carry 
away from the discussion above are a better appreciation of the problems 
of accuracy and, roughly, some of the limitations to accuracy of figures 


derived from measurements. 
Exercises 


1. In a certain school in a southwestern city, the fifth grade had 80 pupils, of whom 
32 were of white, American-born stock, 20 were of Mexican, 10 of Japanese, and 18 of 


American-Indian stock. Complete the following table: 


Stock Frequency | Percentage | Proportion 


American white 
Mexican....-+-- a 
Japanese...» +--+ ++ 2-000 0+ 125 
American-Indian.......-.-- 


2. In the preceding data, what was the ratio of Mexicans to Indians? Of American 
ite i i vhite? 

white to Japanese? Of Indian to American w l l 
3, In maa a child at random from the fifth-grade group, what is the probability 


of getting a Mexican? Of getting a Japanese? An Indian? Either a Mexican or an 


Indi . l 

g Lt; the fourth grade of the same school, the following numbers of children 
appeared: American white, 47; Mexican, 27; Japanese, 11; Indian 15. In the third 
grade the numbers were: 66, 30, 6, and 18, respectively. Prepare a tabulation of the 
data in the three grades. Draw conclusions from the table. 


38 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION s 


5. Draw bar diagrams representing the racial data given above. 
6. Draw a trend chart representing the same data. > 
7. State the exact limits to the following scores or measurements: 57 sec. a 
kg. 65score points 0 score points 14.5cm. — .125 sec. 15 years (to the 
t birthday). Á 
i 8. Ror the following numbers to one decimal place: 26.418 4.072 4.98 n 
9.092 120.052 0.3500 44.7508 291.6500 8.8502 31.15— pn 
9. How many significant figures in each of the following numbers: 1,942 20, 
170.9 0.31 28,000 0.0017 0.3400 21,5000. : D atii 
10. Write the answers to the following problems to as many significant figures as thc 
tules concerning accuracy allow: si 
a. 21.3 in. times 15 (where 15 is an exact number). ` 
b. 5.2 + 17.2509 + 918.04. 
c. 242.8 X 0.075. 
d. 4.27505 divided by 25 (where 25 is an exact number). 
e. 17.98 divided by 2.1. 
f. 38.6 squared. 
g- V50 (where 50 is an exact number, but be reasonable). 
h. 25.3179 2 1 


CHAPTER 3 
FREQUENCY DISTRIBUTIONS 


After we obtain a set of measurements, the next customary step is to 
put them in systematic order by grouping them in classes. A set of indi- 
vidual measurements, taken as they come, as in the list in Table 3.1, does 
not convey much useful information to us. We have merely a vague, 
general conception of about how large they run numerically but that is 
about all. The data in Table 3.1 are scores made by 50 students in an 


TABLE 3.1.—ScoRES IN AN INK-BLOT TEST 


25 33 35 37 55 27 40 33 39 28 
34 29 44 36 22 51 29 21 28 29 
33 42 15 36 41 20 25 38 47 32 
15 27 27 33 46 10 16 34 18 14 
46 21 19 26 19 17 | 24 21 27 16 


ink-blot test. Each score is the number of objects the student reported 
in observing 10 ink blots during a period of 10 min. Concerning such a 
set of data we usually want to know several things. One is what kind of 
score the average Or typical student makes; another concerns the amount 
of variability there is in the group or how large the individual differences 
are; and a third is something about the shape of the distribution of scores, 
i.e., whether the students tend to bunch up at either end of the range or 
at the middle or whether they are about equally scattered over the entire 
The first steps in the direction of answering these questions require 


range. rection 0 
g up ofa frequency distribution. 


the settin, 
Tur CLASS Iyrervat—Its Limits AND FREQUENCIES 

The Size of Class 
scores of 25 there are, © 
quate picture, because 1m a gr 
range from 10 to 55, many scores 

` only once. We therefore combine 
ber of class intervals, each class interva 


units on the scale of measurement. 5 


Interval.—We could begin by asking how many 
£ 26, 27, etc., but this would not give us an ade- 
oup of only 50 individuals whose scores 
do not occur at all and others occur 
the scores into a relatively small num- 
l covering the same range of score 


40 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The first thing to be decided is the size of the class interval. How 
many units shall it contain? This choice is dictated by two general 
customs to which experience has led us to agree. One is the rule that 
we should have not less than 10 nor more than 20 class intervals. Though 
in rare instances we find workers going outside those limits, the general 
tendency is for them to keep within the boundaries of 10 to 15. The 
small number of groups is favored by the fact that we often deal with 
small numbers of individuals in our measured sample and by the urge for 
psi The larger number is favored by the desire for accuracy of 
aia pepe because the process of grouping will introduce minor errors 
into the calculations, and the coarser the grouping, that is, the smaller 
the number of classes, the greater is this tendency. 

; Some Sizes Preferred —The second rule determining the choice of class 
interval is that certain ranges of units (scores) are preferred. They are 
1, 2, 3, 5, 10, and 20. These six intervals will be found to take care of 
almost all sets of data. To apply these rules to our data in Table 3.1, 
Ses feed first to know the total range of scores from highest to lowest. 
Ths highest score is 55, and the lowest is 10, which gives us a total range 
oi a5 points (one more than the highest minus the lowest). An interval of 
3 points is the one that would give us the best number of classes that our 
oe tule requires. It will be found that the range divided by the number 
of units in the class interval (in this case 46 divided by 3) ordinarily gives 
the total number of class intervals needed to cover the range. In this 
instance, we should therefore have 16 groups. If we chose 5 units as our 
class interval, we should have 46/5, which is 10 groups. In view of the 
sec J small number of cases, and because an interval of 5 will give us 

ee of 10 groups, we choose 5 as our class interval.+. 

Endene as Start the Class Intervals.—It would be a quite natural 
of the es i Sany the intervals with their lowest scores at multiples 
12, 15, 18 $ the interval; when the interval is 3, to start them with 9, 
This is by ren the interval is 5, to start with 10, 15, 20, 25, 30, etc. 
trary. When ae most common practice, though it is admittedly arbi- 
starting interv. ~~ size of the interval is 3 or 5, there are arguments for 
i eat the a s = such a way that the multiple of the size of interval is 
give groups ni 7 of the group. Thus the grouping by three’s would 
10, 11, 12 and 18 F , 10, and 14, 15, 16, etc.; by five’s, it would be 8, 9, 
f 3 in th , 19, 20, 21, 22, etc. The midpoints would be multiples 
° e one case and of 5 in the second case. We use score limits so 
some variations 


1 While th ; 
e rules as just stated will be satisfactory for most purposes, 1 
entation of dis- 


will be presented later in co 
i cennecti i i hic repres 
tributions and for estimating a ene ae a ad for grap p 


FREQUENCY DISTRIBUTIONS 41 


much more than we do midpoints, however, that the arguments seem 
mostly to favor beginning intervals consistently with the multiples of the 


¿$° -size of interval, even when the size is 3 or 5 units. 


g . 


- 14 inclusive actually extends 


. Score Limits of Class Intervals—We shall follow the usual practice here 
placing in the lowest interval all scores of 10, 11, 12, 13, and 14; in the 
next higher interval, scores of 15, 16, 17, 18, and 19; etc. (see Table 32). 
Instead of writing out all the scores for each interval, we give only the 
bottom and top scores. Qur intervals are then labeled 10 to 14, 15 to 19 
20 to 24, etc., or, more often, 1-14, 15-19, 20-24. The bottom and top 
scores for each interval représgnt what we call the score limits of the 
interval. They do not indicate exactly where each interval begins and 
ends on the scale of meastirement. The score limits are useful primarily 
in tallying and in labeling the intervals. 

F a 

e 
TABLE 3.2.—FREQUENCY D1STRIBUTION OF THE INK-BLOT SCORES THAT WERE LISTED 
IN TABLE 3.1 


(3) 
Frequencies, f 


Exaci Limits of Class Intervals —We shall soon find that in computa- 
hink in terms of exact limits. Remember that a score of 
ns from 9.5 to 10.5, and that a score of 14 actually means 
5, This means that the interval containing scores 10 to 
from 9.5 to 14.5 on the measurement scale. 
Likewise, the interval having score limits of 15 and 19 has exact limits 
of 14.5 and 19.5 on the scale. The interval labeled 55 to 59 actually 
extends from 54.5 and 59.5. The same principle holds no matter what 
the size of interval or where it begins. An interval labeled 14 to 16 
includes scores 14, 15, and 16 and extends exactly from 13.5 to 16.5. An 
interval labeled 70 to 79 extends from 69.5 to 79.5. It will be seen that 


tions we must t 
10 actually mea 
from 13.5 to 14. 


42 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


by following this principle each interval begins exactly where the one 
below leaves off, which is as it should be (see Fig. 3.1).' 


Tallying the Frequencies.—Having decided upon the size pfetass-injer-=" k 
4 


val and with what scores to start the intervals, we are ready ‘to Tgt ‘thém, 
as in Table 3.2. It is accepted custom to place the highest miecasure- 
ments at the top of the list and the lowest at the bottom, as shown hére. 
Space is left in the second column for the tallging’process. Taking each 
score in Table 3.1 as we come to it, we locate it within its proper interval 
and write a tally mark in the row for thaf interval. Having completes 
the tallying, we count up the number of tagl” fnarke in’each row to find the 

if j yopi 


42 ā 430; 44, i \ i 
41.5 b as 44.5 ! 
95 > x k g 14.5 
25 2.6 27 wy *. 29 


245 


ek, ee S F 
50, 51,52,53, 54,55, 56% 57, 56s, “Sy. 
495 i "4,595 


Fic. 3.1.—Exact limits of class intervals with different st yf interval and of unit, of 
measurement, 


frequency (f), or total number of individuals falling within each group. 
The frequencies are listed in the third column of Table 3.2. `° 7 

Checking the Tallying—Next we sum the frequencies, and if our tallying 
has omitted none and duplicated none, the sum should equal the number 
of individuals. At the bottom of the column we find the symbol 3/, in 
which 2 (capital Greek sigma) stands for “the sum of” whatever fol- 
loys it. Thus, Zf is “the sum of the frequencies.” ‘The total number 
of individuals or measurements in our sample is symbolized by the capital 
letter N, which Stands for “number.” If Df does not equal N, there has 
been a mistake in tallying, and tallying should be repeated until this 
check is satisfied. Even if Zf does equal N, there could have been a 


tally or two placed in the wrong interval. There is no way of checking 
this kind of error except by 


doing the tallying twice. The moral is that 


1 Strictly speaking, limits such as 69.5 and 79.5 also stand for very small distances 
rather than Points. Only in a relative axe are they division points between PES 
Some writers define an interval such as the one containing scores from 70 to 79 as being 
actually from 69.5000 to 79.4999. One could extend the zeros and nines indefinitely. 


For practical purposes the “exact” limits of 69.5 and 79.5 will serve very well when 
measurements are integers, 


Resse soe 


2 ee SBS i. Ke SSS: 
= ~ 


te 


FREQUENCY DISTRIBUTIONS 43 


great care should be taken to make the finding of frequencies correct at 
the first attempt. 
“ — .GRAPHIC REPRESENTATION OF FREQUENCY DISTRIBUTIONS 


The frequency distrjbution in Table 3.2, particularly the array of tally 
marks, gives us a general picture of the group of individuals as a whole. 
We can see, for example, that the most frequent scores fell in the interval 


12 
J0 
38 
2 
’ 3.6 
Q 
m4 
b 
: 2 NK 
0 [Sek 
50 55 60 65 


0 5 0 20 2 30 35 40 4 
: Mn! Scores 
Fic. 3.2.—A frequency polygon for the distribution of scores in the ink-blot test. 


© 


Frequencies, =” 


O n Ff o œ 


Scores 
Fic. 3.3.—A histogram for the same distribution as in Fig. 3.2. 


25-29, that the very low and very high scores are more rare, and that 
the greatest bunching of scores comes in the lower half of the range. 
Much better pictures of this distribution are afforded in Figs. 3.2 and 3.3, 
however, where the general contour of the distribution is more accurately 

p d and the numbers of cases in the various intervals are more 
Fig. 3.2 is of the type known as frequency polygon, and 


represente 
lled histogram, or sometimes, though less often, 


exactly shown. 
Fig. 3.3 is of the type @ 


column diagram. 


44 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The Frequency Polygon and How to Plot It.—A polygon is a many- 
sided figure, and thus the picture in Fig. 3.2 derives its name. There 
are a number of factors to be kept in mind in drawing such a figure. , 

The Kind of Graph Paper —First, it might be said that, in general, the 
most convenient type of cross-section paper is the type that is ruled into 
heavy lines 1 in. apart each way, subdivided into tenths of an inch more 
lightly drawn. 

The Width of the Diagram—Second, the’ question of the height and 
width of the entire figure arises. For the sake of easy readability, the 
width of the figure should be at least 5 in. We have altogether 10 class 
intervals in which there are frequencies, "but in drawing the diagram, we 
should allow for one more class interval-at-each end of the scale, making 
12 in all. This is to permit bringing the ends of the polygon down to 
the base line (see F ig. 3.2). a x 

Labeling the Base Line-—In deciding how many intervals to allow to the 
inch, it is well to remember that we are going to label the base line of 

` 


the figure in terms of our measuring scale and hence should plan things , 


so that {9 in. will stand for an integral number of units on this original 
scale. In the ink-blot data, we have been dealing with a class interval of 
5 units, and we are making room for 12 intervals on our base line—in 
other words, for 60 units. By allowing 10 in. to each unit Ca Se, te 
each class interval), our distribution will spread over an extent of 6 in., 
which is sufficiently large. On the base line, therefore, we label, every 


fifth line with a multiple of 5, beginning with 5 at the left and ending H 


with 65 at the right. 

The Height of the Figure.—The third important question is with regard 
to the relative height of the figure. For the sake of appearance and also 
sa easy reading of the diagram, there is a general custom of making the 
maximum height of the distribution from 60 to 75 per cent of the total 
width. Our total width is 6 in. or 6%o in. Sixty per cent of this would 
be 98{ in., and 75 per cent would be 43o in. Our highest frequency, 
as we see in Table 3.2, is 12. By allowing 349 in. to the person, the 
height of 36/9 would be attained, and by allowing 4{0 in. to a person 
a height of 48/5 in. would be reached. The former comes within our 
rule, and the latter does not; therefore we adopt 3{9 in. as the unit on 
the vertical scale, í 

How to Locate a M: idpoint—In order to plot a dot to represent the fre- 
quency in each class interval, we must next decide above what point on 
the base line the dot shall hes It is plotted exactly at the midpoint of 
the interval, and the midpoint is exactly midway between the exact lower 
and upper limits of the interval. A simple rule to find the midpoint is to 


é 


FREQUENCY DISTRIBUTIONS 45 


average either exact or score limits of the interval. The interval’ con- 
taining scores 10 to 14 inclusive has exact limits of 9.5 and 14.5. The 
entire range is 5 units. Half this range is 2.5 units. Go this far above 
the lower limit, and you have 9.5 plus 2.5, or 12 exactly, as the midpoint. 
This could be written as 12.0. Or deduct 2.5 from the upper limit, 
14.5 minus 2.5, and you also have exactly 12.0 as the midpoint; or the 
average of 10 and 14 is 12.0. The midpoint of the interval 55-59 is 57.0. 
When the class interval is 5 and the lowest score in each interval is a 
multiple of 5, as will be true in many of the instances met in psychology 
the midpoints will end in 2 and 7 systematically. For the 


and education, 
cture of the midpoints for the data in Table 3.2, we 


sake of a complete pi 


Midpoint 
J unit 425 /anit A class interval 
a p of 2 units 
H 42 | 43 l 
Midpoint 
21 units ah o 2% units A class interval 
Á——— of 5 units 
t 45 46 T 48 49 l 
Midpoint 
5units 445 S units A class interval 
_———————— of l0 units 


n a 
43 44'45 46 47 48 49 


40 41 42 48 49 
Frc. 3.4.—Midpoints of class intervals with differing numbers of units. 


have given in Table 3.3 the full set of midpoints. For a general illustra- 
tion of midpoints, see Fig. 3.4. 

Plotting the Points —Having determined the midpoints and knowing 
the frequencies corresponding to them, we are ready to plot the dots for 
the frequency polygon. For the two intervals at the ends of the distribu- 
tion (see Table 3.3) we have frequencies of zero. Sometimes there are 
frequencies of zero not in the last two classes. When so, we plot these 
dots also on the base line and bring the lines that connect the dots down 
to the base line at those places. That did not happen to be the case in 
these data. When the dots are placed at the midpoints, as directed, it 
may be noted that they do not appear directly above the midpoints of the 
marked places on the base line (5, 10, 15, 20, etc., in this case). Remem- 
ber that these multiples of 5 are not the exact limits of the class intervals; 
they are merely convenient and meaningful reference points on our original 
scale. Had we begun the class intervals at scores other than multiples 


46 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 3.3.—CLASS INTERVALS AND THEIR MIDPOINTS 


Score limits | Exact limits | Midpoints | Frequencies ` 
60-64 59.5-64.5 62 0 
55-59 54.5-59.5 57 1 
50-54 49.5-54.5 52 1 
45-49 44.5-49.5 47 3 
40-44 . 39.5-44.5 42 4 
35-39 34. 5-39. 5 37 6 
30-34 29.5-34.5 32 7 
25-29 24. 5-29. 5 27 12 
20-24 19.5-24.5 22 6 
15-19 14.5-19.5 17 8 
10-14 9.5-14.5 12 2 

5-9 4.5- 9.5 7 0 


T a a a 
of 5—for example, at 11, 16, 21, 26, etc—we should still plot at the mid- 
points of the intervals (now different than before) and should still label 
the reference points as multiples of 5, as in Fig. 3.2. The curve as drawn 
truly represents the shape of the distribution as we have grouped the 
scores. 

The Histogram and How to Plot It—Many of the facts learned in 
plotting the frequency polygon also apply in plotting the histogram. 
The choice of size, proportions, units per square of graph paper all are 
the same. The only important difference is that although we locate the 
height of each column or rectangle by placing a dot at the midpoint of 
each interval, we do not then connect dot to dot with straight diagonal 
lines. Instead, we draw a short horizontal line through each dot (see 
Fig: va 8 extending it to the upper and lower exact limits of each class 
interval. Those exact limits are given in Table 3.3 for our data. Having 
done this, we erect vertical lines at each of these exact limits tall enough 
to form complete rectangles. Again it may be noticed that the rectangles 
seem to be misplaced a half unit with respect to the numbers on the base 
line, but this is correct; the choice of limits for our classes makes the 
exact limits come a half unit below the multiples of 5, i.e., at 4.5, 9.5, 
14.5, 19.5, ete. 

Advantages and Disadvantages of the Two Types of Figure.—On the 
whole, the frequency polygon seems generally preferred to the histogram. 
For one thing, it gives a much better conception of the contour of the 
distribution; the transition from one interval to another is direct and 
probably describes the distribution more accurately. The histogram 


FREQUENCY DISTRIBUTIONS 47 


gives a stepwise change from interval to interval, based upon the assump- 
tion that the cases falling within each interval are evenly distributed over 
the interval. The polygon gives the more correct impression that on both 
sides of the highest point (directly above the mode), the cases within an 
interval are more frequent on the side nearer the mode, except where 
there are inversions in the general trend (as between scores of 15 and 25 in 
Fig. 3.2). 

On the other hand, the histogram gives a more readily grasped repre- 
sentation of the number of cases within each class interval; each measure- 
ment or individual occupies exactly the same amount of area. One more 
advantage favoring the polygon is that when we wish to plot two distribu- 
tions overlapping on the same base line, as, for example, two different age 
groups or the two sexes, the histogram type gives a very confused picture, 
whereas the polygon type usually provides a clear comparison. 

Plotting Two or More Distributions When N Differs.—The comparison 
of two distributions graphically raises a new question when the numbers 
of individuals in the two groups differ. With large differences, naturally, 
there is the question of scale, or how much space to give the figure. If 
the smaller distribution is large enough to be clearly legible, the larger 
one may extend beyond reasonable bounds. Furthermore, if it is general 
shapes and general positions on the measuring scale and dispersions that 


e, the marked difference in size may make such com- 


we wish to compar mas : 
n to this difficulty is to 


parisons very unsatisfactory. A common solutio t 
reduce both distributions to percentage frequencies instead of plotting the 
original frequencies. It is then as if we had two distributions, each of 
whose N’s equal 100. This makes their two areas approximately equal in 
the polygon form, and comparisons of shape, level, and dispersion are then 
quite satisfactory. Ta 

How to Find Percentage Frequencies:—As an example of how to trans- 
form frequencies into percentages the data in Table 3.4 are presented. 
In each case, the frequencies in the distribution are each multiplied by 
100, then divided by N. A shorter procedure would be to find the 
quotient 100/N to four or more decimal places, then multiply each fre- 
quency in turn by this ratio. In distribution I, the ratio is 100/51, which 
equals 1.9608, and in distribution II it is 100/ 160, which equals 0.6250. 
Multiplying each frequency fi by 1.9608, we obtain the list of percentages 
in column (4), and multiplying each frequency fz by 0.625, we obtain the 
list in column (5). Plotting these percentages above the corresponding 
midpoints of class intervals, we obtain the distribution curves in Fig. 3.5. 
Although it was apparent in Table 3.4 that the second group were higher 
on the scale than the first and that there was still considerable over- 


48 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 3.4.—Frequency DISTRIBUTIONS OF SCORES IN A COLLEGE-APTITUDE TEST FOR 
FRESHMEN AT Two DIFFERENT COLLEGES 


a) (2) (3) (4) (5) 
Scores fi f: Pi P: 
140-149 8 5.0 
130-139 32 20.0 
120-129 48 30.0 
110-119 1 29 2.0 18.1 
100-109 0 18 0.0 it:2 
90-99 3 14 5.9 8.8 
80-89 5 5 9.8 3.1 
70-79 6 5 11.8 Ki 
60-69 14 0 27.5 0.0 
50-59 7 1 13.7 0.6 


S 
nN 
is} 


Percentage: 


Fic. 3.5,—Distributt È es. 
been reduced to aa ons of scores in an aptitude test in two colleg 


lapping of scores between the two, these facts are more clearly brought 
im graphic form. Also much clearer is the somewhat narrower dis- 
persion in the second with the first. 

Skewed Distributions = Ta e fact is more clear that the 
first group bunches at the left in its own range and has relatively few 
high scores, whereas the second group bunches at the upper end of its 
range, with relatively few low scores. We describe the first distribution 


FREQUENCY DISTRIBUTIONS 49 


as being positively skewed (pointed end toward the right or positive direc- 
tion) and the second distribution as being negatively skewed (pointed end 
toward the left_or negative direction). The greater irregularity of con- 
tour in the first distribution is probably due to the small number of cases 
originally in this group. The changing of the two distributions to the 
percentage basis has not changed the contour, only the general vertical 
size of the curves. 

Comparison of Two Histograms.—The same two distributions as illus- 
trated in Fig. 3.5 may also be shown in the form of histograms. When 
overlapping histograms become rather involved and confusing, writers 


E7777 


LA 


A 


Percentages 
S 5 o 


w 
> 


30 40 50 60 70 80 90 100 110 120 130 140 150 
Scores ina scholastic - aptitude test 


Fic. 3,6.—Same distributions as represented in Fig. 3.5 shown in the form of two 


Listograms. 


sometimes resort to the device shown in Fig. 3.6. In that illustration, 
a mirror reflection is pictured for one of the distributions, but both are 
drawn on the same horizontal scale. The frequency scale (in terms of 
percentages here) is repeated, also in mirror reflection. _ The shading 
of the rectangles is optional, but it has the virtue of making the entire 
surface within each histogram stand out from the page. 

Other Variations in Presenting Overlapping Curves.—The distribu- 
tions in Fig. 3.5 are clearly represented as shown in two overlapping 
polygons. There are certain instances in which such line drawings will 
not suffice. One of these is when the two distributions are so extensively 
overlapping that there is considerable crisscrossing of lines and only con- 
fusion would result unless something is done about it. Fig. 3.7 demon- 
strates such a situation and also how the matter is handled, namely, by 
showing the one polygon in a dotted line. By inspection one can readily 


50 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


see to which group all parts of a polygon belong. The groups are identi- 
fied, each with its type of line, by giving the code, in this instance, in the 
upper right part of the chart. Figure 3.7 also includes desirable informa- 
tion such as is lacking in Fig. 3.5, namely, the total number of individuals 
in each sample. 

Figure 3.8 gives another demonstration of overlapping distributions 
that call for several different kinds of lines. This is generally desirable 
when there are more than two polygons on the same chart and when 
there is any overlapping at all. 


=- Examined in 
Sept. 1942. N=5027 


Examined in 
Sept. 1943. N =3348 


Percentage frequency 


High school College 


7 d 
F Last year of schooling completed — Ţ 
= ape overlapping fiestas polygons representing distributions of years of 
Song completed by samples of aviation students in the AAF. 


Figures 3.7 and 3.8, particularly, demonstrate how much meaning one 


can extract from pictorial representations of frequency distributions. 
Questions of Policy governing the selection and training of aviation 
students during World War II hinged upon questions of age and of 
formal education of recruits, and it was important to maintain a clear 
Picture of changing Status of the trainees in these respects. From Fig. 3.7, 
for example, one would conclude that the typical recruit was a high-school 
graduate and that men of this category comprised more than half of all 
recruits. It might have been surprising to some of the commanding 
officers to find that there were recruits with as little formal schooling a5 
8 years who could pass the Army Air Forces qualifying examination. 
Those with less than 12 years of school were in very small percentages, 


FREQUENCY DISTRIBUTIONS 51 


however, and either this type of man did not apply in large numbers for 
aircrew training or he was screened out quite generally by the qualifying 
examination. The fact that the two curves, for samples a year apart, 
are almost identical throughout indicates that the same kind of men, so 
far as previous education was concerned, were applying and qualifying 
for admission to AAF flying training. 

The distributions of aircrew recruits as to chronological age (Fig. 3.8) 
tell quite a different story. Within the same period of a year, although 
the same range of ages prevailed (it was limited by regulations) there 


30 
Examined Sept /942 
(N= 3205) 
25 —-—- Examined Mar. 1943 
(N= 4500) 


«=; Examined Sept. 1943 
(N= 3347) 


Ds 
oO 


a 


Percentage freguency 
Oo 


uo 


25 26 27 28 


7 18.19 20 a$ 22f 23 f 24 
21.25 223 234 
Age to the nearest birthday 


Fic. 3.8.—Three overlapping frequency polygons representing distributions of chrono- 
logical ages of aviation students in the AAF. 
was a drastic trend toward reduction of age. This is shown by the fact 
that the mode (age having the greatest frequency) was at twenty-three 
years in the September, 1942, sample, at twenty-one years in the March, 
1943, sample, and at nineteen in the September, 1943, sample. The skew- 
ing was slightly negative in the earliest sample and markedly positive in 
In one of the samples there was a secondary mode at 
twenty-seven years. ‘This reflects the known fact that many twenty- 
seven-year-old men expedited their entrance into AAF flight training in 
order to assure acceptance before reaching the age limit. 

Smoothing a Frequency-distribution Curve.—Any set of measurements 
like those in Fig. 3.5 is usually regarded as one sample out of a larger 
‘Population having practically the same properties as the ones obtained in 


the latest sample. 


52 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the sample. The first group is one of freshmen entering a certain college 
in a given year. If it is assumed that over a run of years the kind of 
students seeking entrance and the kind accepted remain about the same, 
the 51 students whose scores are given here may be said to represent the 
larger population. Had we obtained similar scores for this larger popu- 
lation, the irregularities seen in Fig. 3.5 would no doubt have been 
minimized. 

We frequently wish to forecast, from the supposed representative sample 
that we have, how a larger population would distribute itself. To do this, 
we smooth the frequency distribution in the following manner. We pre- 
dict from the frequencies we have what the corresponding frequencies 
would be in the larger population by a system of running BNEEB EPS. In 
this process, we permit the two frequencies on either side—i.e., in the 
immediately neighboring intervals—to help determine the expected fre- 
quency in any class. In Table 3.5, the obtained frequencies fo are given 


n r m 7! p RES 
TABLE 3.5.—ORIGINAL AND SMOOTHED FREQUENCIES FOR A DISTRIBUTION OF Sco 


IN A ScHOLASTIC-APTITUDE TEST 
eas hg epee 


(1) (2) (3) 
Scores So fe 
120-129 0 0.25 
110-119 1 0.50 
100-109 0 1.00 
90- 99 3 2.5 
80- 89 5 4.75 
70- 79 6 he ho 
60- 69 14 10.25 
50- 59 7 9.75 
40- 49 il 8.25 
30- 39 4 4.75 
20- 29 0 1.00 

Sums....... 51 51.00 


in column (2), and it will be noticed that two class intervals have been 
added at the ends of the range of scores. p 
Running Averages of Frequencies. —As a first illustration of the running- 
average method, let us apply it to finding the expected frequency fein the 
interval 70-79. The obtained frequency here is 6. We average this 
along with the two immediately neighboring frequencies, 5 and 14. But 
we allow the middle frequency to carry twice as much weight; so we add it 


tay 


* 
FREQUENCY DISTRIBUTIONS 53 


twice: 5+6+6+14 = 31. We have added four numbers; so we 
divide by 4, obtaining 31/4 = 7.75. This is our predicted frequency for 
the interval 70-79. Doing the same for the interval 40-49, we have 
74+11+11+4 = 33. Divided by 4, this becomes 8.25. For the inter- 
val 30-39, we have 11 + 4+ 4 + 0, all divided by 4, which gives us 
4.75. If we wish to do so, we may even estimate frequencies in the 
end classes given, for example, in the interval 20-29. Here we have 
4+0+0+0 =4, and divided by 4 the outcome is 1.00. All the 
expected frequencies for this distribution aregiven in column (3) of Table 


œ 


a 


Frequencies 


> 


nn 


O LeTo —lo > 
10 20 30 40 50 60 10 80 90 100 110 120 130 140 
Scores 
Fic. 3.9.—A smoothed distribution curve for the scholastic-aptitude scores in Table 3.5. 
The circlets represent obtained frequencies. Dots represent new (smoothed) fre- 
quencies obtained by the use of running averages. 


3.5. Their sum is equal to 51, which is a rough check upon the accuracy 


of computation. 


Plotting a Smoothed Distribution.—The final step is to plot the smoothed 


curve, which we have in Fig. 3.9. First the obtained frequencies are 
plotted as circlets in their proper places. It is always well to show these 
even though we do not draw the curve through them as before. The 
expected frequencies are next plotted as points. We can probably see by 
inspection that the smoothing could be improved upon. In drawing the 
smoothed curve, we do not feel compelled necessarily to touch all the dots. 
Being concerned with the general shape freed from probably accidental 
fluctuations, we take the liberty of further smoothing by inspection and by 
free-hand drawing. If there were too many irregularities, even in the 
smoothed points we could, of course, repeat the-averaging process, but this 
is usually not wise, because it tends to flatten the entire distribution too 
much and should be avoided if possible. In the present instance, very 


54 FUNDAMENTAL STATIS TICS IN PSYCHOLOGY AND EDUCATION 


little further adjustment of frequencies was needed in order to produce the 
smoothed and rounded contour seen in Fig. 3.9. We may expect with 
some confidence that the larger population from which this group is drawn 
will distribute more like the rounded curve than like the irregular one we 
actually obtained. 

A Semigraphic Report in Typewritten Form.'—When making reports of 
frequency distributions, in typewritten form particularly, when the num- 
ber of cases is not too small, a good form is to let a period stand for one 
individual, a colon stand for two, and an x stand for five, as in Table 3.6. 


‘Teo FREQUENCY DISTRIBUTIONS 
TABLE 3.6.—A SEMICRAPHIC Report or Two FREQUENCY DISTR 
BSOA SEMIGEARHG RERORT OF CID A A e 


Number of women 


Age at last Number of men entering A 
entering college 


birthday college 


31-35 
26-30 

25 x: 
24 $ 
23 XK. 
22 
21 
20 


19 XXXXXXXXXXXXXKXXX: . 
18 XXXXXXXXXXXXXXX: 
17 XXXXXXXXXXXXX! 2 


16 z 
ee E A 
When the frequencies are small numbers, the same plan Re ee — 
picture if we let an æ or some other letter stand for each individual. li 
When Coarse Grouping Is Desirable-—It was indicated in an ear i 
footnote that there are occasions when the rules given for size ponor 
of class intervals should be modified. In making a ee ee ane 
of data it is often desirable to reduce the number ai class ssi wil 
below 10, and to make the intervals correspondingly larger. Doing so wI 
often provide a much better picture. 
In small samples, (for = particular purpose we may define a a 
sample as one with an N less than 100), with fine grouping, oe 2 is 
are likely to be irregular. Sometimes the effect upon mee ete 
to produce a “saw-tooth” contour. It is very probable that the pep? ve 
tion distribution, if we had it, would be smooth and regular. me) hin 
usually want the sample distribution to reflect the general picture o: 


XXXXXXX: 


. i ice, which he 
1 I am indebted to H. M. Cox for being introduced to this convenient device, Y 


attributes to F. S. Beers, 


FREQUENCY DISTRIBUTIONS 55 


population from which it came and which it is supposed to represent, we 
would like to avoid those irregularities. One solution already offered is 
that of smoothing the distribution curve. There are some who object to 
smoothing as the remedy, and for them there is another possibility. In 
general, curves will be more regular if grouping is coarser. 

Another aspect to this problem is that the particular frequencies we 
obtain by grouping are strongly dependent upon the choice we make in 
starting each class interval. With the same size of class interval, we 
might derive quite a different-appearing frequency polygon simply by 
making our division points between classes at other places, particularly 
if the sample is small. One can readily demonstrate this by choosing an 
appropriate interval of 3, let us say, and by setting up three distributions, 
starting the lowest interval at 12, 13, and 14, respectively, when the lowest 
scoreis14. By introducing coarser grouping, this phenomenon, too, tends 
to be counteracted. 

ing problem is the position of the 


Another consideration in this group 
mode, i.e., the point on the measurement scale corresponding to the highest 


point on the frequency curve. As different sizes of interval are utilized, 
and as different starting points for intervals are chosen, so the mode may 
shift up or down on the measurement scale, even jumping from interval to 
interval. Coarser grouping will also tend to stabilize the interval and the 
value of the mode. 

Based upon certain mathematical considerations which we cannot go 
into here, Kelley has proposed that the number of classes to be utilized in 
the graphic representation of a distribution should be determined roughly 
from the size of sample as shown in Table 3.7. 

From the information given in Table 3.7, one would be justified in using 
only 8 classes for the ink-blot test data, which have been used so exten- 
sively for illustrations in this chapter. This number of classes would 
mean a class interval of 6, which could, of course, be used, though it is not 
in the preferred list. An interval of 10, which is in the preferred list, 
would result in only 5 classes, which would be less than are called for in 
Table 3.7. Remember that the coarser grouping is called for, thus far, 
only for the purpose of graphic representation. The requirement of 10 
or more classes still holds for computations such as we meet in the chapter 
to follow. Since one is often faced with the need of both graphic and 
computational use of data, some kind of compromise is practically desir- 
rable and defensible in many instances. The illustrative example is 
probably such an instance. The 10 classes used for the ink-blot data yield 
a frequency polygon which is rather regular, with one notable inversion, 
and the same 10-class distribution will serve for the computations required. 


56 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 3.7.—THE NUMBER or CLASSES To USE IN PREPARING FREQUENCY DISTRI- 
BUTIONS FOR GRAPHIC REPRESENTATION FOR DIFFERENT SIZES OF SAMPLE* 


Sample Size (N) Number of Classes 
4- 5 2 
Gc 8 3 
9- 14 4 
15- 21 5 

22- 32 6 
33- 46 7 
47- 64 8 
65- 89 9 
90-117 10 
118-153 11 
154-192 12 
193-255 13 
256-315 14 


* From Kelley, T. L. Fundamentals of statistics. Cambridge: Harvard University Press, 1947. 
P. 133. Reproduced by permission. 
The reader will be reminded in the next chapters, however, that with less 
than 12 classes, it is necessary to make certain corrections for “grouping 
errors” when certain accurate computations are desired. 


Exercises 
1. For each one of the following ranges of measurements, state your judgment of 
(1) the best size of class interval, (2) the score limits of the lowest class interval, (3) the 
exact limits of the same interval, and (4) its midpoint. 


a. 83 to 197. b. 4 to 39. 
c. 17 to 32, d. 35 to 96. 
e. 0 to 188. J. —24 to +28. 


g. 0.141 to 0.205. 


2. Given the following list of scores in a “nervousness” test (Data 3A) and using 
a class interval of 5, set up a frequency distribution. In the first solution, begin the 
lowest class interval with a score of 35. List all exact limits of class intervals and 
also exact midpoints. In a second solution, start the lowest class interval with a 
score of 33. After finishing both solutions, write out a comparison of the two dis- 
tributions and defend the choice of the one as against the other. As a third solution, 
use an interval of 3, choosing your own starting places for the classes. Discuss the 


relative merits of the third distribution as compared with the first two. 


Data 34.—Scores IN A NERVOUSNESS INVENTORY 


FREQUENCY DISTRIBUTIONS 57 


, 3. Given the following list of scores, each of which is the percentage of 400 words 
judged pleasant by an individual (Data 3B), set up a frequency distribution making 
the wisest choice of class interval and class limits. a 


Data 3B.—AFFECTIVITY RATIOS 
(All have been rounded to the nearest whole number) 


43 | o2 | sa | 48 | 46 | 65 | 43] 48 | 52] 51 | 57 | 48 | 48 
38 | 42 | 44 | 46 | 43 | 35 42 | 45 | 44 | 46 | 40 | 40 
47 | 52 | 38 | 51 | 45 | 38 | 51 | 40 | 46 | 45 | 54 | 55 | 41 
so | so | 42 | 39 | 56 | 44] 43 | 47 | 51] 43 | 50 |] 34 | 40 
s3 | 42 | 31 | 44 | St | 43 | 48 | 41 | 43 | 48] 41 | 55 


4. Plot a frequency polygon and a histogram for Data 3C, Group I. State your 
conclusions about these data as revealed by your plotted distributions. 


DATA 3C—DrstRIBUTIONS OF CHEMISTRY-APTITUDE Scores IN Two FRESHMAN 
CHEMISTRY Courses, I anp IT 


Sere Frequencies | Frequencies 

for Group I | for Group II 
90-94 4 2 
85-89 10 0 
80-84 14 0 
75-79 19 0 
70-74 32 2 
65-69 31 £ 
60-64 40 5 
55-59 28 12 
50-54 29 13 
45-49 21 21 
40-44 18 21 
35-39 10 19 
* 30-34 6 20 
25-29 1 14 
20-24 3 1 
Sums.... 266 134 


__ 


5. Apply the smoothing process described in this chapter to Data 3C, Group I. Plot 
a curve based upon the smoothed frequencies but show the original frequencies as points, 
as was done in Fig. 3.9. In what respects has smoothing changed the picture of these 


data? P 
6. Reduce distributions I and II (Data 3C) to percentage distributions, and plot 


them on the same diagram. Make a descriptive comparison of the two distributions 


as drawn. 


CHAPTER 4 
MEASURES OF CENTRAL TENDENCY 


This chapter is about averages, of which there are several kinds. Three 
of them—the arithmetic mean (or mean, for short), the median, and the 
mode—will be explained here. Two others, the geometric mean and the 
harmonic mean, being much less useful to students of psychology and 


education, will be briefly mentioned. 
An average is a number indicating the central tendency of a group of 


observations or of individuals. To the question, “How good is a sixth- 
grade class in arithmetic?” the most reliable and meaningful kind of 
answer would be the mean or median in some acceptable test of arith- 
metical achievement. To the question, “What is the weakest tone to 
which this dog will respond?” the best kind of answer is to state the average 
result from a number of trials. In either case a single score or a single 
measurement of the threshold stimulus would be highly unreliable, for not 
all measurements, even from repeated observations of the same thing, 
have the same value. To answer those questions by reciting the long list 
of individual measurements would be highly uneconomical in the reporting 
and not very enlightening to the questioner. 

The average, whether it be a mean, median, or mode, serves two impor- 
tant purposes. First, it is a shorthand description of a mass of quantita- 
tive data obtained from a sample. It is surely more meaningful and 
economical to let one number stand for a group than to try to note and 
remember all the particular numbers. An average is therefore descriptive 
of a sample obtained at a particular time ina particular way. Second, it 
also describes indirectly but with some accuracy the population from which 
the sample was drawn. If the sample of sixth-grade children is repre- 
sentative of all the sixth-grade children in the same school, in the same 
city, or even in the same county, then the average of their scores tells us 
much about the average that would be made by the population that they 
represent, be it school-wide, city-wide, or county-wide. If we examine the 
dog’s hearing under a set of conditions that is characteristic of his general, 
day-to-day existence, the sample average will be very close to one that we 
could actually obtain by testing him day after day on many days. It is 


only because sample averages are close estimates of larger population 
58 


a) 


MEASURES OF CENTRAL TENDENCY 59 


averages that we can generalize beyond particular samples at all and make 
predictions beyond the limits of a sample. This means considerable 
economy of effort, but far more important than that, it makes possible all 
scientific investigation. We rarely or never know the average of a popu- 
lation, consequently we do not know by how much our obtained average 
has missed it, but if our sampling has been done in the proper manner 
we can estimate about how far we may have missed it, as will be shown in 
Ch.9. In the present chapter we will be concerned only with the methods 


of computing averages from sample data. 


Tue ARITHMETIC MEAN 


The Mean of Ungrouped Data——Most readers already know that to 
find the arithmetic mean (popularly called the average), we sum the 
measurements and then divide by the number of measurements or cases. 
In terms of a formula 


DX (The arithmetic mean) (4.1) 


where M = arithmetic mean. 


> = “the sum of.” 
X = each of the measurements or scores in turn. 


N = number of measurements or scores. 
In a certain experiment to determine the lowest frequency of vibration of a 
sound wave that would yield a tone for a human observer, 10 trials were 
given, with the following results: 13, 17, 15, 11, 13, 14), 07, 135 sod) 1 
(cycles per second). The sum of these measurements is 132, and therefore 
the mean is 13.2 cycles per second. Note that in reporting a mean it is 
given in terms of the unit of measurement, which is specifically stated. A 
mean is never an abstract number; it is always a mean of something and is 


always in terms of some unit of measurement. 
As another example, the scores on the ink-blot test found in Table 3.1, 


when summed, give 2X equal to 1,480. The mean, with the use of 


formula (4.1), is 
3x 1,490 _ 
u = 2X = A = 29.60 


s 29.60 score units. In practice, it is quite 
ean to round to one more figure at the right 
ents had—in this case, to keep one decimal 


The mean ink-blot score i 
customary in reporting & m 
than the original measurem 


60 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


place where the original scores were whole numbers. We report the mean 
as 29.6 score units.’ 

The Mean of Grouped Data.—When data come to us grouped or when 
they are too lengthy for comfortable addition without the aid of a calcu- 
lating machine or when we are going to group them for other purposes 
anyway, we find it more convenient to apply another formula for the mean: 


_ 3X 


M = N (4.2) 


(Arithmetic mean from grouped data) 


where the symbols V and È have the same meaning as before. 
X = midpoint of a class interval. 
f = number of cases within an interval. 
The solution by way of this formula is illustrated in Table 4.1. Here we 


TABLE 4.1.—COMPUTATION OF THE MEAN IN GROUPED DATA 


a) (2) (3) (4) 

s Sot p i 

SORES Midpoint j 
55-59 37 1 ar 
50-54 52 1 52 
45-49 47 3 141 
40-44 42 4 168 
35-39 37 6 222 
30-34 32 7 224 
25-29 27 12 324 
20-24 22 6 132 
15-19 17 8 136 
10-14 12 2 24 
Sums..... 50 1,480 
N sfx 

zfx 1,480 _ 
Mean = “Vy SO 29.60 


have only as many different X values as there are class intervals instead of 
perhaps as many as there are original measurements. Each class interval 
has as its X value the midpoint of that interval. This assumes that the 
midpoint of the interval correctly represents all the scores within that 


1 One could determine the number of accurate significant figures in a mean by apply- 
ing the rules in Ch. 2 at each step of the operations. Further consideration will be 
given the question of the number of places to report in a mean after discussion of the 


standard error of a mean in Ch. 9. 


MEASURES OF CENTRAL TENDENCY 61 


interval. This will not be exactly true in many instances, but the dis- 
crepancy is small in any case, and in computing the mean, most of the 
discrepancies counterbalance others, so that the final result is essentially 
correct.! 

In column (2) of Table 4.1, the midpoints of the intervals are given. We 
must add each midpoint into our total as many times as there are cases 
within the interval. This means finding for each interval the product f 
times X, or f¥. The fX products are listed in column (4). The sum of 
the fX products (2X) is equal to 1,480. Dividing this by W, we find the 
mean to be 29.60, as it was for the same dataungrouped. Aswasindicated 
before, we should not be surprised to find a minor discrepancy between 
the means calculated from grouped and ungrouped data. It happened 
here that the discrepancy was zero. We may also expect trivial discrep- 
ancies in means when the same data are grouped differently, i.e., with diff- 
erent size of class interval or with different starting points for intervals of 
the same size. - 

The Mean Computed by the Short Method.—When the original meas- 
urements are relatively large numbers, particularly when the midpoints 
and the frequencies are large numbers, the method just described can well 
give way to a short-cut procedure that saves pencil-and-paper work. 
Even greater saving is appreciated when, as in the next chapter, a standard 
deviation is also to be computed. This procedure requires ime use of a 
guessed average, which we call M’, and of new or “coded values to 
replace the midpoint values. The steps are illustrated in Table 4.2, 
including the coding process. Tn this table it can be seen that many of the 
actual midpoints would be four-place numbers; for example, the highest 
interval has a midpoint of 154.5 (midway between 149.5 and 159.5) “ee 
consequently the fX products would become also rather large. The code 
values for the intervals, given in column (3), are called x” and will now be 


explained. 

Choosing a Guessed 
be chosen anywhere, 
greatest benefits from the s 


ect a guessed mean. This may 


for its choice is arbitrary. In order to obtain the 
hort method, however, it is well to choose a 


guessed mean rather near to the actual mea n, at any rate, paa ped 
the center of the distribution. Several criteria guide us in = sr 
Choice. One is to place the guessed mean at the cai pe o n : 
Class interval (if there is an even number of intervals, paral ~ z 
middle ones is eligible). The distribution in Table 4.2 is distinctly skewed, 


however, with the bulk of the cases at the lower part of the range; so the 
: s” affect statistics will be found in the chapter 


Mean.—First we sel 


_ TA discussion of how “grouping error 
immediately following. 


62 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 4.2——CoMPUTATION OF THE MEAN IN GROUPED Data By USING THE SHORT 


METHOD 

a) (2) (3) (4) 

Scores f x feet 
150-159 2 +6 +12 
140-149 2 +5 +10 
130-139 4 +4 +16 
120-129 1 +3 +3 
110-119 5 +2 +10 
100-109 5 +1 +5 

+56 > 
90- 99 12 0 0 
80- 89 10 -1 —10 
70- 79 12 —2 —24 
60- 69 10 —3 —30 
50- 59 1 —4 -4 
—68 
Siite 64 —12 
N Ifs 
c (2%) 10 (3) = — 1 = 1.88 
N 


M = M’ +c = 94.5 + (—1.88) = 94.5 — 1.88 = 92.62 


mean we find will probably fall in an interval lower than the middle one. 
Another criterion is to choose the interval containing the median (see 
Table 4.3 for the method of finding the median). In this distribution, the 


median falls within the interval for scores 80-89. This is farther from the _ 
center than we would ordinarily go for the guessed mean. Another guide © 


is to choose an interval that has a large number of cases—in fact, the 
largest number. Here such an interval is that for scores 90-99. As a 
good compromise among all of these criteria, the interval labeled 90-99 
seems best. We should actually come out with the same computed mean 
no matter which interval we chose for the guessed mean; the choice is dic- 
tated entirely by the desire to keep the numbers small so that “headwork” 
can replace paper-and-pencil work as much as possible. 

The Size of Class T: nterval Becomes the Temporary Working Unit—Having 
chosen the interval 90-99, we guess the mean to be at the midpoint of this 
the midpoint being 94.5 (midway between 89.5 and 99. 5): The 
score point of 94.5 becomes the temporary zero point for our measuring 
scale. In column (3), a zero is written in line with the interval whose mid- 
94.5. The first interval above is given a value of +1; the second, 


group, 


point is 


— 


_ Adding this correction to the guessed 


MEASURES OF CENTRAL TENDENCY 63 


+2; the third, +3; etc. The first interval below is given a value of —1; 
the second, —2; etc. These x’ values now represent the class interval i 
which are just one unit apart. The new unit is equivalent to 10 sc . 
units, a fact that we shall have to remember later. K 
: The Correction to Add to the Guessed Mean.—From here on, the steps are 
similar to those taken in Table 4.1. Next we find the fx’ product for zadi 
interval, taking great care to record algebraic signs. All products above the 
guessed mean are positive, and all products below are negative. The sum 
of the positive products is +56, and the sum of the negative products is 
=6 . The algebraic sum of the entire column is therefore 56 — 68 
whic) equals —12. The fx’ therefore equals —12. From this we cai 
find directly how far the actual mean is from our guessed mean. The 
actual mean is equal to M’ plus a correction c, and this correction is given 


by the formula 


Sfl 
; (Eh z (Correction to add to the guessed mean) (4.3) 


where i = size of the class interval. 
x’ = deviation of a class interva 

ias the unit. 
f = frequency withi 
N = total number of meas 


Tn this problem, ¢ = 10, 3/x’ = 


DAE A E 


mean, we have 


] from the guessed mean in terms of 


n a class interval. 


urements. 
—12, and N = 64. Therefore 


M' + c = 94.5 — 1.88 = 92.62 


The mean is 92.62 score units, but we should report it merely as 92.6 
score units. 

A Summary of the Short Solution of the Mean.—The steps involved in the 
short method of computing the mean may be summarized as follows: 
Step 1. Set up the frequency distribution. - 

This is the midpoint of the interval (1) 


Step 2, Choose a guessed mean. ; o 
near the center of the distribution; or (2) containing the median or 


mode or both; or (3) probably containing the actual mean. 
Step 3. Assign to the class intervals new small integral values, starting 
with zero at the interval contaimng the guessed mean, with posi- 


64 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


tive values above and negative values below. Call these new 
values x’. 

Step 4. Find the fx’ product for each interval, and record in a column. 

Step 5. Sum the fx’ products algebraically. This is Sfx’. 

Step 6. Divide the sum of the fx’ products by N. 

Step 7. Multiply this quotient by 7, the size of the class interval. This 
gives the correction c. 

Step 8. Add this correction algebraically to the guessed mean. This gives 
the mean. 


A single formula representing the preceding steps is J 


Ta (Arithmetic mean from grouped and coded 
M =M'+ i( y data) (4.4) 


where the symbols are as previously defined.! 


THE MEDIAN 


The median is defined as that point on the scale of measurement above 
which are exactly half the cases and below which are the other half. 
Note that it is defined as a point and not as a score or any particular meas- 
urement. If this conception is kept clearly in mind, many difficulties will 
be forestalled. Some textbooks on statistics give a different definition of 
median for ungrouped as compared with grouped data and recommend 
two different procedures for computing the median. Here we shall apply 
the same definition to both cases and be consistent in computation 
throughout. 

The Median from Grouped Data.—It is probably easier to grasp the 
Process of computing a median in grouped data. For a first illustration, 
consider Table 4.3. Here there are 28 cases; so the median is that number 
of points on the measuring scale above which there are 14 cases and below 
which there are 14, Counting frequencies from the bottom upward, we 
find that 4 + 1 +1 ++ 10 = 16 cases, or 2 more than we want. To make 
14 cases, we need 8 out of the 10. The median lies somewhere within the 
interval 15-19, whose exact limits are 14.5 and 19.5. We assume for the 
sake of computation that the 10 cases within this interval are evenly 
spread over the distance from 14.5 to 19.5 (see Fig. 4.1). We must inter- 
polate within this range to find how far above 14.5 we need to go in order 
to include the 8 cases we need below the median. We must go 8/10 of 
the way, for 8 is the number we require, and 10 is the total number in the 


1 The solution by use of formula (4.4) will be better understood by those who follow 
the proofs offered in Appendix A and who apply those proofs here. 


n 


MEASURES OF CENTRAL TENDENCY 65 


TABLE 4.3.—COMPUTATION OF THE MEDIAN SIZE OF CLASS In A CERTAIN SCHOOL, 
WITH THE UsE OF GROUPED DATA 


Class size] f 


40-44 1 
35-39 0 
30-34 3 
25-29 5 

3 


20-24 12 = number of cases above the interval containing the median 


15-19 10 


10-14 1 6 = number of cases below the interval containing the median 
5- 9 1 
0- 4 4 
N = 28 5 3 
Mdn = 14.5 + 3o X 5 = 14.5 + 4.0 = 18.5 
Mdn = 19.5 — Xo X 5= 19.5 — 10 = 18.5 
interval. The total distance is 5 units; so on the scale of measurement we 
go 8/10 of 5, or exactly 4.0 units. i2 Gasescare 
Adding this 4.0 to the lower limit of above 19.5 
i 5 19.5; 
the class interval 14.5, we get Hane o 
14.5 + 4.0 = 18.5 as the median. above 18.5 es 
We can check this by counting | ig.5<——Median 


down from the top of the distribu- 
tion until we include W/2 of the 
cases; 14 in this problem. Starting 


8 
7 
6 
14 Cases are 5 165 
4 
3 
2 


at the top, we find that 120 
= J2 below 18.5 

1+0+3+5+3 me 

We need 2 more cases out of the next ne 

group of 10. We must go 2/10 of so 


the way below the upper limit of l 
the interval, that 1s, below 19.5. » ET 
This means 2/10 of 5 or exactly 1.0 below 14.5 
unit. The upper limit, 19.5 minus Ho, Sets how the 10 cases in 
f edian, the interval 14.5 to 19.5 are distributed. 
1.0, gives us 18.5 for the m . d Each case is assumed to occupy a tenth 
which checks with the one obtaine of the jue) s one-half of a score 
E . Itis wit. e eighth one extends up t 
by counting up from below the point 18.5, which is the median s 


well always to check the determina- ; : 
tion of a median in this manner, and to do so involves very little work. If 


the two estimates do not agree exactly, something is wrong. 


114.5: 


66 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 4.4—COoOMPUTATION OF THE MEDIAN SCORE IN A SENTENCE-CONSTRUCTION 
Test as Given To 37 MEN 


Scores ji 


37-38 1 
35-36 2 
33-34 0 
31-32 1 
29-30 0 
27-28 6 15 = number of cases above interval containing the median 
25-26 5 á 
23-24 8 
14 = number of cases below interval containing the median 
21-22 8 
19-20 5 
17-18 1 
N =37 N/2 = 18.5 
4.5 2, 9 m 
Mdn = 22.5 + 3 X 2 = 22.5 + a7 22.5 + 1.125 = 23.6 
Mdn = 24.5 — $Š X 2 = 245-1 = 24.5 — 875 = 23.6 


To take another example with grouped data, consider Table 4.4, where 
N is an odd number. Here N/2 is 18.5, but the principle of interpolating 
within an interval for the exact median is just the same. Counting up 
from below, we find that 1+ 5+ 8 = 14, which lacks 4.5 cases of 
including the lower half. In the next interval, we must go 4.5/8 of the 
way, or 4.5/8 times 2, which equals 9/8, or 1.125. Adding this many 
units to the lower limit of the interval (22.5), we have 23.625 as the 
median; or dropping all but one decimal place, we report the median as 
23.6 score units. Checking by counting down from' the top, we find 15 
cases above the point 24.5. Going 3.5/8 of the way down into the interval 
of 2 units, we find that we must deduct 0.875 from 24.5 to find the median. 
When rounded to one decimal place, the median is 23.6, as before. In 
terms of a formula, the interpolated median is found from below by 


Mdn =1+ 


T i (Interpolation of a median from below) (4.5a) 
p 


where } = exact lower limit of class interval containing median. 
Fy = sum of all frequencies below J. 
fo = frequency of the interval containing Mdn. 
N and # are defined as usual. 


ll 


MEASURES OF CENTRAL TENDENCY 67 


In terms of a similar formula, the median is found from above by 


SR, 
So 


where u = exact upper limit of the interval containing the median. 
Fa = sum of all frequencies above 4. 


Other symbols are as defined previously. 
A Summary of the Sieps for Interpolating a Median.—The steps for com- 
puting a median from grouped data may be summarized as follows: 


Mdn = u — i (Interpolation of a median from above) (4.58) 


Step 1. Find 4/2, or half the number of cases in the distribution. 
Step 2. Count up from below until the interval containing the median is 


located. 
Determine how many cases are needed out of this interval to make 


N/2 cases. . 
Divide this number needed by the number of cases within the 


Step 3. 


Step 4. 
interval. 


Step 5. Multiply this by the size of class interval. 
Step 6. Add this to the exact lower limit of the interval containing the 


median. 
Check by adding down from the top to find to what point the 
r half of the cases extend in a manner analogous to that 


Step 7. 
uppe: 
described in Steps 2 to 5 inclusive. 

Step 8. Deduct the number of score units found in Step 7 from the exact 

upper limit of the interval containing the median. 

Some Special Situations.—There are some instances in which things do 
not turn out just as they did in the two illustrative examples. 

When the Median Falls between Intervals.—If it should happen, in adding 
up cases from below, that half the cases take in all the cases in the last 
interval, the median is then the exact upper limit of that interval. In 
counting down from above, it would be found that all the cases in the 
interval just above this one would also be required to make N/2; so its 
exact bottom limit would be the median. This coincides with the exact 
upper limit of the interval below; thus, the median checks. As an exam- 


ple, note the following fictitious data: 


25-29 | 30-34 | 35-39 | 40-44 | 45-49 | 50-54. | 55-50 


8 3 y 


70 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


` like this, we do the reasonable thing of assigning the crude mode to the 
dividing point between these intervals, which is 22.5. Unless the data 
are reasonably numerous, so that there is clearly an interval of highest 
frequency, we should not attempt to assign a modal value to the dis- 
tribution. For example, the 10 measurements of threshold for pitch 
present an unusual situation with the greatest frequency (four cases) 
of 11, which is at one end of the distribution. Following right behind 
is the measurement 13, with three cases. Here it would be rather mean- 
ingless to say that the mode is 11. 

Estimation of the Mode by Coarse Grouwping.—In certain methods of esti- 
mating the mode (such as the interpolation method described below) it is. 
frequently helpful to resort to coarser grouping (smaller number of class 
intervals) than usual. This results in larger frequencies within the classes. 
Larger frequencies are more stable in the sense that a change of one or two 
cases more or less would affect them relatively less than when frequencies 
are small. They would change relatively less, also, from sample to sample 
drawn from the same population in the same manner. Furthermore, dif- 
ferences between frequencies in different classes are likely to be greater so 
that there is less doubt as to which interval contains the mode. Following 
a recommendation made by Kelley,' the optimal conditions for estimating 
the mode prevail when the numbers of classes are as given in Table 4.5. 


TABLE 4.5.—OPTIMAL NUMBERS OF CLASSES FOR ESTIMATING THE MODE FOR DIFFERENT 
Sizes or SAMPLE 


Classes 


Inter polation of the Mode.—In the method given above for the estimation 
of the crude mode, the midpoint of the interval with the largest frequency 
was chosen as the best value. This is adequate for smaller samples (less 
than 100 cases) and even for larger samples when the distribution is sym- 
metrical (not skewed). When distributions are noticeably skewed, how- 
ever, the midpoint value is not as accurate an estimate as we can make. 
With larger samples, and even with smaller ones that yield a regular con- 
tour, we can interpolate within the modal interval to obtain a more accu- 
rate estimate of the mode. The procedure is best explained by reference 
to Fig. 4.2. 

1 Kelley, T. L., Fundamentals of statistics. Cambridge: Harvard University Press, 


1947. P.259. In Table 4.5, the relation of the range of measurements to the standard 
deviation, particularly in small samples, has been taken into account (see Table 5.8). 


MEASURES OF CENTRAL TENDENCY 71 


In that illustration only the intervals with the three highest frequencies 
are shown. The three highest frequencies should be neighboring if this 
procedure is used. Nor should this procedure be used if these three fre- 
quencies do not reflect a skewing in the same direction as that for the 
entire distribution. The three in this illustration imply negative skewing 
for the total distribution.” 

The principle of this method is to fix the value of the mode within the 
interval in proportion to the frequencies in intervals on either side of the 


Mo= 52.5 
a mode from the three frequencies nearest 


Fic, 4.2.—Tllustration of the estimation of 
to it. 
Here the greater neighboring frequency is in the interval 
consequently the mode is estimated toward the 
upper limit of the modal one. In the interpolation process we assume that 
the nearness of the mode to the lower or upper limit of the modal interval 
should be proportional to the frequency in the interval neighboring upon 
each limit. ‘The modal point can be estimated by the formula 


fo P (Interpolated mode) (4.62) 


Mo= (a 
ower limit of modal interval. 

] immediately above. 
] immediately below. 


modal interval. 
above the modal one, 


where J = exact | 
fa = frequency in interva 

fo = frequency in interva 

i = size of class interval. 


72 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The computation can be checked by working from the upper limit, by the 
formula 
fo ) i (Another inter; i 
lo = = te jE 2 polation of the mode) 4.6b 
Mo=t TER i ( ) 
Applying these formulas to the data represented in Fig. 4.2, we have, 
first, 
e 18 E 
Mo = 49.5 + Geen) 5 
= 49.5-+ .6X 5 
= 49.5 + 3.0 
De 


I 
wr 
N 


Second, we have 


Mo = 


vr 
S 
or 
L l 
aks = 
x ao 
wats 
= 
N 
au 
wn 


Í 
U U uw 
ses 
ou on 

| 

[SS] 

(= 


It will be seen that the result is conservative and that it would require 
an unusually uneven balance between fa and fẹ to move the mode very far 
from the midpoint of the modal interval. This is probably as it should be, 
but it suggests that with fa and fe very similar there is little need for apply- 
ing the method. 

If the contour of the distribution is irregular, a preliminary step to the 
use of this interpolation method would be to introduce some smoothing 
by the running-average procedure described in the preceding chapter. 
Enough regularity would then be introduced to make feasible the use of 
formulas (4.6a) and (4.60). 

The Mode Estimated from the Mean and Median.—Fortunately, 
because of certain mathematical relationships between the mode and the 
other two measures of central tendency, we can estimate the mode from 
them. A simple approximation formula is 


Mo = 3Mdn — 2M (Estimation of a mode from mean and median) (4.7) 


In other words, the mode equals three times the median minus two times 
the mean. 

Applying this formula, we can now estimate the mode of the distribu- 
tion in Table 4.2, in which we were unable to decide upon a crude mode. 
The median for this distribution is 88.5, and the mean is 92.62. Although 
we rounded the mean to one decimal place in reporting it, in further cal- 


MEASURES OF CENTRAL TENDENCY 73 


culations with it, we do well to keep the second decimal place. Applying 
formula (4.7), the computed mode equals 


(3 X 88.5) — (2 X 92.62) = 265.5 — 185.24 = 80.26. 


Rounded to one decimal place, the estimated mode is 80.3. Reference 
to the distribution in Table 4.2 again will show that this point comes 
about midway among the four high frequencies. Had we done a very 
reasonable thing and placed the crude mode midway among these four 


intervals, it would have been at 79.5, which is less than one unit from the 


calculated mode. ; 
The mean of the distribution in Table 4.3 is 19.14 and the median is 


18.5. The calculated mode is (3 X 18.5) — (2 X 19.14), which equals 
55.5 — 38.28, or 17.22. This is separated from the crude mode, which is 
17.0, by a trivial amount. In the distribution in Table 4.4, the median is 
23.6, and the mean is 24.52. From this information, the mode is estimated 
as 21.8, which deviates from the crude mode only 0.7 unit. It may add 
meaning to the computed mode to say that it is the point on the measuring 
scale at which the smoothed distribution curve probably has its highest 


point. 


WHEN To EMPLOY THE MEAN, MEDIAN, AND MODE 


e Mean.—The arithmetic mean is to be 
preferred whenever possible because of several desirable properties. In 
the first place, it is generally the most reliable or accurate of the three 
measures of central tendency. By this we mean that from sample to 
sample of the same population, the mean will ordinarily fluctuate less 
widely. Another reason is that the mean is better suited to further 
arithmetical computations. Deviations of single cases from Ate arti 
tendency are important information about any cece sia Is 
done with these deviations, as will be seen in the ollowing chapter. It 

e deviations, and this we are really 


wil that we square those & 
re a Peo only when the deviations are taken from the mean. 


7 trical, we may almost always use 
. a tang are reasonably Siema 
When distributions a the median and mode. On the othe: 


t refer it to eins 

so a i particularly when — owe and 

when the mean would lead to erroneous ei ou g istribution, in 

which other measures of central tendency Att andea roperty of 
A Comparison of the Mean with Me ee i seo i > 

the mean is that it is sensitive to the size of extreme ae waen 

they are not balanced by other extreme measurements on the other side 


Certain Advantages of th 


74. FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of the middle. In the following set of measurements, the mean is 9 and 


the median is 9: 
4, 5,7, 9, 11, 13, 14 


Now, if the 14 had been 23 instead of 14, the median would be unchanged, 
but the mean would become 10. There are still an equal number of cases 
above and below 9. So far as the median is concerned, the 11, 13, and 14 
could have been 110, 130, and 140, and still the median would be 9. But 
in this rather unusual but not impossible event, the mean would become 
57.9, where formerly it was only 9. The conclusion to be drawn is that 
when, in a small sample particularly, there are any very extreme measure- 
ments not balanced by other extreme measurements in the other direction, 
the median is to be preferred to the mean. 

Some Mathematical Properties of the Arithmetic Mean and the Median. —A 
better appreciation of the nature of the mean and of the median may be 
gained by noting some of their mathematical peculiarities. ‘To illustrate, 
let us use the data presented in Table 4.6. There six scores are given for 
six individuals. The mean of these scores is 6.0 and the median is 4.5. 


TABLE 4.6.—ILLUSTRATION OF CERTAIN PROPERTIES OF THE ARITHMETIC MEAN AND 
THE MEDIAN 


(1) (2) (3) (4) (5) (6) 
Deviations | Deviations | Deviations Deviations 
Person Score from the from the | from the mean, |from the median, 
mean median squared squared 
A 2 —4 —2.5 16 6.25 
B 3 -3 —1.5 9 2.25 
C 4 —2 —0.5 0.25 
D 5 —1 +0.5 1 0.25 
E. 9 +3 +4.5 D? 20.25 
F 13 +7 +8.5 49 72.25 
Sums.......... 36 0 +9.0 88 101.50 
Means iaa 6.0 0.0 +1.5 
Median........ 4.5 — — 


The first feature to be pointed out is that the mean is the center of gravity 
of the scores. In Fig. 4.3 we have the six scores represented on the 
measurement scale. Imagine that the six individuals are arranged in 
their proper places along this scale. Imagine that the scale itself is a rigid 
plank or bar. The six persons may be regarded as exactly the same in all 
respects except for their scores on this scale. Each “weighs” the same; 
his effect upon the tilting of the bar depends only upon his position upon it. 


MEASURES OF CENTRAL TENDENCY 75 


If we wish to rest the bar upon a single fulcrum in such a position that the 
bar will be perfectly balanced, that position must coincide with the mean. 
The measurements in any sample are perfectly balanced about the arith- 
metic mean. 

Each individual in this small distribution carries an effective weight in 
proportion to his distance from the mean. In the parlance of the physi- 
cist, each person’s distance from the mean is called a moment. In sta- 
tistics, also, we often speak of moments in a similar sense. In column (3) 
of Table 4.6, each of the six moments for this small distribution is given. 
They are more commonly called deviations from the mean or simply devia- 
tions. The size of each deviation indicates how much effective weight the 
moment carries and its algebraic sign tells in what direction that weight is 
applied. The algebraic sum of these moments is zero, as it always is when 
the arithmetic mean and the deviations are correctly computed. This is 


Arithmetic 
mean (6.0) 


Fic. 4.3.—Illustration of the positions of six cases with respect to the arithmetic mean 
and with respect to the median. Ifall cases carry equal weight, they are perfectly bal- 
anced when the fulcrum is placed at the arithmetic mean. 


simply another indication that the mean is a center of gravity, for the 
positive and negative moments about the mean are perfectly balanced. 
The arithmetic mean is the only value in a distribution from which the 
deviations always sum algebraically to zero. To show that the median 
does not qualify in this respect, let us find the deviations of the six scores 
from the median and sum them (see Table 4.6). The algebraic sum of the 
deviations from the median is 9.0. This means a net balance of 9 units 
on the plus side. A fulcrum placed at the point 4.5 on the scale would be 
seriously overbalanced toward the end with the high scores. This comes 
from the fact that in computing a median we ignore the distance of each 
case from the central value. If we want the bar to balance when the ful- 
crum is placed at the median value, we will have to rearrange the cases, 
treating all cases above the median as if they had the same value and all 
cases below the median as if they also had the same value and a value as 
far below the median as the above-median group was placed above it. 
Not only are the deviations from the mean balanced about it but they 
have another important property. If we square each deviation, we have 


76 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the squared moments about the mean. The peculiarity of the mean is 
that the sum of the squared deviations about it is smaller than that for the 
squared deviations about any other value. In most of the following chap- 
ters we will be concerned with squared deviations from the mean. For 
the present, it is merely significant to point out that when squared devia- 
tions are considered the arithmetic mean is closest to the measurements of 
the sample asa whole. In Table 4.6 we can see that for this small sample 
the sum of squared deviations is much smaller when the reference point is 
the mean than when it is the median, the two sums being 88 and 101.5. 
The reader may verify the fact that 88 is the smallest possible sum of 
squared deviations in this sample by arbitrarily choosing other values as 
possible points of central tendency. 

Central Tendencies in Skewed Distributions.—In skewed distributions, 
the mean is always pulled toward the skewed (pointed) end of the curve, 


A B 
mean $ Mode Mode * Mean 
Median Median 


Fig. 4.4—Two skewed distributions, (A) skewed negatively and (B) skewed positively, 
showing the relative positions of modes, medians, and means. Note that the mean is 
displaced farther from the mode toward the skewed end of the distribution and that 
the median is displaced about two-thirds as far. 


as Fig. 4.4 shows. The arithmetic mean, as the center of gravity of the 
distribution, is weighed toward the extreme values, as was demonstrated 
above. The sum of the deviations on the one side of it equals the sum of 
the deviations on the other side. The median comes at a point that 
divides the area under the distribution curve into two equal parts. The 
number of scores on the one side of it equals the amber of scores on the 
other. The interpretations of mean and median should be made accord- 
ingly. For example, for the data on class size in Table 4.3, the median 
of 18.5 tells us that half of the classes had 19 or more students enrolled and 
half of them had 18 or less. The mean class size, which is 19.1, tells us 
that if all the enrolled students had been reapportioned so as to make all 
classes the same size, the enrollment in each class would have been 19.1, 
or 19, with a few students left over. 

When the Mean Is Misleading—In some instances, to give the mean of 
a distribution only is highly misleading; for example, in a study of class 
size in a certain university, among 62 classes, there were 2 classes having 


MEASURES OF CENTRAL TENDENCY 77 


more than 200 students, and 2 having between 100 and 200 students, all 
the remaining classes except 2 being smaller than 60. The average dee of 
the 62 classes was 34, but this was not very typical, because half of the 
classes had 20 or less (the median was 20.5). The most ¢ypical size of class 
would be given as the mode, which was 17 (crude mode). If our purpose 
happened to be to equalize the size of classes, assuming that this were 
practical, we could conclude that there would be 34 students per class. If 
we wanted to decide as a matter of educational policy whether or not there 
were too many small classes in general and if we had concluded beforehand 
that most teachers can successfully handle 30 students in a group, then the 
median would tell us, without knowing anything more about the distribu- 
tion, that there were entirely too many small classes. The mean would 
not have told us this, because it was higher than 30. If we were piloting 
r about the buildings while classes were in session and 
e most likely size of class he would find at 
the mode, since this size is more likely to 
occur than any other one size. If we were purchasing equipment to suit 


classes of various sizes, we should adapt it, if necessary, most often to 
ough in this case we should also want to know more 


a visiting inspecto 
wished to prepare him for th 
random, we should give him 


classes of modal size, th 
about the entire frequency distribution. 
Mean and Median Often Both Reported. 
of skewed distributions, it is usually well to state both the mean 
ts own story, and from the difference 
between the two we can immediately infer in what direction the distribu- 
tion is skewed and about how strongly. Although the mode is easily and 
quickly determined and will often serve until better averages can be 
computed, it should probably never be reported alone and need not be 
reported with the other two averages except when it is meaningful to do 
so. When a distribution is symmetrical about the mode, the three aver- 
and so only one of them, preferably the mean, need be 
ith the fact that the distribution is symmetrical. 
Is Especially Called For.—There are one or two kinds 
of distribution in which the median is the only satisfactory average. 
Distributions with Indeterminate Values —There are some distributions 
in which some of the extreme values are not accurately determined. We 
know that they lie out beyond a certain point on the scale but we do not 
know just how far. In certain work-limit tests, for example, some sub- 
jects would work on fo al lengths of time if permitted to do so. 


r unusu 
Suppose that all those who work on a certain test up to 10 min. are arbi- 
trarily stopped. They are in the 


minority, so a median can be found. 
Time spans up t 


—In reporting upon central 


tendencies 
and the median, since each tells i 


ages will coincide, 
reported, together w 


When the Median 


o 10 min. may be classified as usual into chosen class 


78 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


intervals. From 10 min. up, we find the laggards grouped together. We 
do not know just how long they might have kept working had we let them 
continue. An arithmetic mean cannot be determined here, but median 
and mode can still be utilized. 

When Equality of Unit Is Uncertain.—In another instance, we are not 
sure that all the units of our measuring scale are equal. This is particu- 
larly true in the psychological scaling methods of rank order and of equal- 
appearing intervals. In the former case, a number of judges have placed 
several objects or persons in rank order for some quality. Though the 
ranks are numerically equidistant, the things ranked probably are not. 
When combining ranks for any one object, we do less violence to the 
measurement if we find a median rather than a mean. In the other 
instance, though objects are placed in piles or categories that seem equi- 
distant to the observer, again we are not sure that his categories are 
numerically equidistant, and the median is a safer statistic to compute. 
It is also true in this scaling method that distributions of judgments for 
objects very high or very low on the scale are skewed or even truncated 
because of the “end effect.” By the end effect, we mean that although 
some judges would like to place some stimuli above the highest pile or 
category (or below the lowest), they are not permitted to do so. Some 
objects or persons rated thus pile up in the end categories when some of 
these times they should have gone beyond the end. This fact will distort 
the arithmetic mean but will not influence the median so long as not more 
than half of all the judgments for an object fall in the end group. 

A Summary of When to Use the Three Averages.—In brief, the follow- 
ing rules will generally apply: 

1. Compute the arithmetic mean when 

a. The greatest reliability is wanted. It usually varies less from 
sample to sample drawn from the same population. 

b. Other computations, as finding measures of variability, are to 
follow. 

c. The distribution is symmetrical about the center, particularly 
when it is approximately normal. 

d. We wish to know the “center of gravity” of a sample. 


2. Compute the median when 

ie is not sufficient time to compute a mean. 

b. Distributions are badly skewed. This includes the case in which 
one or more extreme measurements.are at one side of the distri- 
bution. 

c. We are interested in whether cases fall within the.u per-or lower 

-/fhalves of the distribution and not nace Weta far from 
the central point. 


MEASURES OF CENTRAL TENDENCY 79 


d.~An incomplete distribution is given. 
e. There is uncertainty about the equality of the unit of measure- 
ment. 
3. Compute the mode when 
a. The quickest estimate of central tendency is wanted. 
b. A rough estimate of central tendency will do. 


c. We wish to know what is the most typical case. 
ee ee 


MEANS IN SOME SPECIAL SITUATIONS 


The measures of central tendency described thus far will take care of the 
great majority of situations in which such statistics must be computed. 
There are some problems, which, though rare, require other treatment. 
Four of these will be briefly mentioned: means of arithmetic means, means 
of percentages (and proportions), geometric means, and harmonic means. 

Finding Means of Arithmetic Means.—When one has the means of 
several samples, presumably from the same population, on the same test 
or scale, he may want to know the over-all mean for the samples combined. 
At first thought, it might seem appropriate simply to average the several 
means just as one would average single observations. This would be 
proper procedure provided the samples are of the same size. If the N’s 
in the samples differ, however, the means are not equally reliable. In 


order to extract the best information about the central tendency of the 
ght each mean according to the number of 


a it was derived, for a mean’s reliability is 
ample. This procedure is equivalent to 
pooling all the single measurements from the different samples and com- 
puting a single over-all mean. We can accomplish the same end by 
computing a weighted mean of the means which we already know. The 


general formula for computing a weighted mean is 


entire sample, we should wei 
cases in the sample from whicl 
in proportion to the size of s 


M = ZWX (A weighted arithmetic mean) (4.8) 
IW 
where „M = weighted mean. 
W = weight. 
SWX = sum of the values 
appropriate weight. 


SW = the sum of the weights. 
Table 4.7 illustrates the application of this formula. In the problem 


represented there, four means differing considerably had been derived” 
from samples ranging from approximately 400 to approximately 2,700 


being averaged, each multiplied by its 


$ kd 
80 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 4.7—COMPUTATION OF A MEAN OF ARITHMETIC MEANS, WITH AND WITHOUT 
WEIGHTING THE SAMPLES* 


(1) (2) (3) (4) 
Number in the Mean of the aig 
Group sample sample Weighted: mean 
F AA pees NM: = WX 
Ni =W M;=X near 
A 15 25.6 384.0 
B 27 31.3 845.1 
C 9 38.7 348.3 
D 4 32:5 130.0 
a AET 55 = SW 128.1 = 3X 1707.4 = SWX 
Mikio] wow Remad 32.0 = M= 31.0 = Mz 


* The samples were of scores on a perceptual-speed test administered to aviation students and 
other military personnel. The sizes of samples were approximately 100 times the values given. 
Rounding was done to simplify the illustration. It probably did not affect the size of the weighted 


mean materially. Ni is the number of cases in sample J, and Mi: the mean of sample J. 


cases each.! The unweighted mean of these four means would be 32.0, 
whereas the weighted mean is 31.0. The latter is much more representa- 
tive of all the individuals in the combined sample. 

When the means to be averaged are very close together, as they will 
ordinarily be when samples are drawn from the same population and are 
not too small, and when the N’s do not vary much from sample to sample, 
the weighted and unweighted means will be very close together. In 
certain situations, then, the unweighted mean may be reported. But if 
the composite mean is to be used for further computations, in which case 
it should often be estimated to the second decimal place, weighting cer- 
tainly is called for. 

The Mean of Percentages or of Proportions.—The weighting procedure 
just described is even more important in determining the mean of a series 
of percentages or of proportions. Table 4.8 illustrates this point. The 
data in that table have to do with the percentage of pilot students elimi- 
nated in certain schools during one training period. Had the schools had 
the same enrollment, or even very nearly the same, the unweighted mean 
would suffice. Since the largest class is nearly four times as ‘great as 
the smallest, however, and since elimination rates vary from 3.3 to 27.2, 
there is a marked difference between weighted and unweighted means. 
If we wished to know the over-all elimination rate in order to make 
decisions for some administrative purpose, the unweighted mean would 


1 The means are so different and samples are so large that it is highly unlikely that 
the samples came from the same populations. They will serve to illustrate the proce- 


dure nevertheless. 


Z 
MEASURES OF CENTRAL TENDENCY 81 
TABLE 4.8.—COMPUTATION OF AN AVERAGE PERCENTAGE* 
(1) (2) (3) (4) 
School Number enrolled | Number eliminated | Per cent eliminated 
Ni N:P:/100 Pi 
G 243 55 22.6 
H 63 7 j o BE 
K 196 43 21.9 
L ól 2 3a 
S 125 34 27.2 
ORS R 688 = SN; 141 = SN;P;/100 86.1 = XP; 
Means..... 137.6 = My 17.2 = Mpt 


_ eee 
* The data represent students enrolled in five AAF pilot schools selected to illustrate this 


procedure. 
+The weighted mean of the percentages equals 14,100/688 = 20.5. The value 17.2 is the 
unweighted mean. 
be misleading. Certainly, when the percentage or the proportion in a 
composite is wanted for further computations, the weighting procedure is 
essential, unless the sample N’s are exactly equal. 
In terms of a formula, the weighted mean of a percentage is 


SNP: 
= 2N:P: (Mean of percentages where N’s differ) (4.9) 


wll p DN; 


where N; = number in each sample. 
P; = percentage for each sample. 
DN;P, = sum of products of each percentage times its corresponding N. 
EN; = sum of the sample N’s. 
A completely analogous formula applies to finding the weighted mean of 
proportions, in which case p is substituted for P. 

The Geometric Mean.—The arithmetic mean of two numbers is found 
by adding them and dividing by two. The geometric mean of two numbers 
is found by multiplying the two numbers and then taking the square root. 
The arithmetic mean of 2 and 18 is 10.0. The geometric mean is 


/2 X& 18 = V'36 = 6.0. 


n of three numbers is the cube root of their product; of 


The geometric mea 1 
urth root of their product; and so on. In terms of a 


four numbers the fo 
general formula, 


= G i 
GM = VX XXX ba X Xv ; Siemens (4.10) 


82 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where GM = geometric mean. 
Xn, Xo, © + + , Xw = series of measurements. 
N = number of measurements. 


When there are more than two measurements to be averaged in this 
manner the computations become bothersome, unless we resort to the use 
of logarithms. “The students of mathematics will recognize that if we take 
logarithms of both sides of formula (4.10) we obtain the equation 
(log X) 

F; 


log GM = N 


(Logarithmic solution of geometric mean) (4.11) 


In other words, the steps called for are as follows: 

Step 1. Convert each X into a corresponding log X, by using Table K, 
Appendix C. 

Step 2. Sum the log X values. 

Step 3. Divide this sum by N. This result is the logarithm of the geo- 
metric mean, as shown by formula (4.11). 

Step 4. Find the antilog of the value obtained in step 3. This is the 
geometric mean. 


These steps are illustrated in Table 4.9, which will be explained next. 


TABLE 4.9.—Conputation or A GEOMETRIC MEAN OF TONES MATCHED ror LOUDNESS 
TO A STANDARD TONE 


(1) (2) (3) 

Trial Stimulus | Logarithm of the 
(R) stimulus (log R) 

1 14 1.1461 

2 8 0.9031 

3 22 1.3424 

4 7 0.8451 

5 10 1.0000 

SU cence 61 5.2367 

Means... 2... 12.2 1.0473 


Geometric mean (antilog of 1.0473) = 11.2 


One of the instances in which the geometric mean applies in psychology 
is in the averaging of stimulus values in psychophysics, when those stimu- 
lus values are used to indicate psychological quantities rather than physi- 
cal quantities. The data in Table 4.9 are fictitious and were invented to 
illustrate a point. Let us suppose that an observer with very poor dis- 
criminative power was asked to control a sound-generating instrument so 


MEASURES OF CENTRAL TENDENCY 83 


as to produce a sound matching in loudness a tone that he has just previ- 
ously heard. On five different trials the readings of his settings might be 
as given in column (2) of Table 4.9. We want to find his average setting. 
The arithmetic mean, as shown in column (2), would be 12.2 units. 
According to what we know about psychophysical relationships this 
would be incorrect. We are really interested in the mean of his sensory 
responses; the loudness of the tones that he hears. We assume these to 
lie on a psychological scale whereas the stimuli lie on a scale of physical 
Let a value on the psychological scale be called S and one on the 
physical scale be called R. From Fechner’s psychophysical law, the rela- 
tionship of S to R is usually stated in the equation S = C(log R). Strictly 
speaking, the R values should be expressed as multiples of the stimulus 
limen, but that need not concern us particularly here. We may assume 
that the R values in column (2) are multiples of the threshold stimulus. 
In this connection the reader may be reminded of the decibel scale for 
The decibel-scale values are proportional to the 
Ten decibels represents a stimulus 10 times as 
strong physically as the threshold stimulus; 20 decibels one 100 times as 
strong; 30 decibels 1,000 times, and so on. The physical values increase 
in a geometric series while the psychological values are assumed to 
progress in a parallel arithmetical series. ; 

To return to Table 4.9, the logarithms of R are found in column (3). 
Their sum is 5.2367 and their mean is 1.0473. The antilog of this value is 
11.2, which is the geometric mean. It will be seen that this value is 1.0 
n the arithmetic mean of the same stimulus values. We 
observer the stimulus that for him seems 
d sound is one of 11.2 units. 
ı—Probably the most common use of 
the geometric mean in psychology has already been illustrated, namely, in 
psychophysics.! There are other places in which it may well bp preferred, 
for example, in many instances 1m which time measurements ane used, 
including reaction-time measurements. The need for a geometric mean 
may be indicated when distributions are distinctly positively skewed. It 
is best, however, to look for, some rational basis, such as the existence of 


geometric series, before deciding to compute this kind of mean. A rate-of- 
growth measurement, for example, often involves a geometric series. An 
A is that a geometric mean cannot be com- 


important limitation to mention Pia. geor eun 4 
re when any measurement in the distribution is zero or negative. 


Harmonic Mean.—Like the geometric mean, the harmonic mean is 
oni $ > 
needed because the measurements were not made on an appropriate scale. 
1 See Guilford, J. P Psychometric methods. New York: McGraw-Hill, 1936. 
ord, J. P., #5) 


energy. 


loudness of sounds. 
logarithm of the stimulus. 


unit smaller tha 
would conclude that for this 
most equivalent to the standar 

When to Use the Geometric Mear 


84 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


A common application for it in psychology is in connection with “ work- 
limit” tests. In such tests the score is the amount of time required to 
complete a fixed quantity of work. The frequency distribution of such 
scores is often positively skewed. Such tests, if given in the more usual 
form of “time-limit” tests, would yield scores in terms of units of work 
accomplished in fixed time. The frequency distributions of such scores 
are more commonly nearly symmetrical. If the ability or abilities meas- 
ured are assumed to be normally, or at least symmetrically, distributed in 
the population from which the sample came, then it is reasonable that the 
work score is a more representative one than the time score; representative 
in the sense that it spaces individuals better along a scale of equal units. 

The harmonic mean (HM) is defined as the reciprocal of the mean of the 
reciprocals of the measurements. The formula is 


ma = 3 b a) (Equation defining a harmonic mean) (4.12) 


where N and X are as usually defined. 
Taking reciprocals of both sides, we have 


HM = 5i (Same as (4.12) in reciprocal form) (4.13) 
Te 

HM = Lr (Computing formula for harmonic mean) (4.14) 
Le 


The computational steps are as follows: 


Step 1. Convert every measurement into its reciprocal. For the purposes 
of this method, at least three significant figures should be retained. 

Step 2. Sum the reciprocals. 

Step 3. Divide N by the sum of the reciprocals. 


The computation of HM is illustrated in Table 4.10. The scores are in 
terms of number of minutes required to complete a series of 180 reactions. 
They have been invented for the illustration. The reciprocals are found 
in column (3). Their sum is .2834 and their mean is .0567. The recip- 
rocal of this number, 17.6, is the harmonic mean. This is interpreted as 
the average amount of time required to complete the task. 

The whole procedure may be made more reasonable by showing that 
the conversion of each time score into a reciprocal is equivalent to con- 
verting it into a work score. In column (4) are shown the “rate” scores 
in terms of reactions per minute. Person A took 36 min. to complete 180 


MEASURES OF CENTRAL TENDENCY 85 
4 


TABLE 4.10.—COMPUTATION oF A Harmonic MEAN OF WORK-LIMIT SCORES 


(1) (2) (3) (4) 
Time scores in | Reciprocals of Rate scores in 
Person minutes* time scores reactions per minute 
(X) (1/X) (180/X) 
36 0.0278 5 
20 0.0500 9 
18 0.0556 10 
15 0.0667 12 
12 0.0833 15 
101 0.2834 51 
20.2 0.0567} 10.2ł 
p 


* Number of minutes required to complete 180 reactions. 
i The reciprocal of .0567 is 17.6, the harmonic mean. 
t When converted to a time score, this becomes 180/10.2 = 17.6. 


reactions. His rate is 5 reactions per minute. And so on, for the others. 
A little study of the values in column (4) will show that each is 180 times 
the corresponding value in column (3). The arithmetic mean of the rate 
scores is 10.2 reactions per minute. This can be converted back to a time 
score by the ratio 180/10.2 or it might be left as a work-rate score. If 
it is converted, the result checks with the harmonic mean. 

As in the case of the geometric mean, the harmonic mean cannot be 


computed when any X is zero or negative. 


Exercises 
Data 4B.—Arrectivity SCORES 


DATA 4A .—SCORES IN AN ENGLISH-USAGE 
(Per cent of 400 words marked “‘pleasant”’) 


EXAMINATION 
Scores J Scores f 
52-53 1 95-99 
50-51 0 90-94 N 
48-49 5 85-89 is 
46-47 10 80-84 Pn 
44-45 9° 75-79 9 
42-43 14 70-74 8 
40-41 7 65-69 2 
38-39 8 60-64 s 
36-37 6 35-59 a 
34-35 5 30-54 i 

Sa cena 65 
32-33 3 
Sum.... 68 


86 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Data 4C—ScorEes MADE BY GRADUATES Data 4D.—Scores IN AN ADJUSTMENT 
AND ELIMINEES IN THE COMPLEX Coorpi- Inventory OBTAINED FROM ALCOHOLICS 


NATION TEST BY STUDENT PILOTS AND NoNALcoHOLICS OF BotH SEXES* 
Pa a ERR MM 
Frequencies Frequencies 
Scores z 
Graduates | Eliminees Males Females 
| Scores 
95-99 : Alco- Notts Alco- ou 
20-94 1 holics alee holics alco: 
85-89 7 1 holics holics 
80-84 13 2 - 
75-79 37 6 66-71 1 
$ 60-65 6 3 
70-74 75 23 54-59 13 1 2 1 
65-69 189 34 48-53 13 1 10 
60-64 297 94 42-47 17 3 u 1 
55-59 406 144 
50-54 425 208 36-41 33 3 | 2 1 
30-35 32 2 8 8 
45-49 341 209 24-29 | 32 ooa e 
40-44 174 205 18-23 23 | 16 5 | 26 
35-39 81 105 12-17 24 | 36 2 | 40 
30-34 16 34 
25-29 5 15 6-11 7 | 43 2 | 49 
0-5 1 25 21 
20-24 2 
15-19 1 * Manson, M. P. A psychometric differenti- 
ation between alcoholics and non-alcoholics, 


Quar. J. Stud, Alcohol, 1948, 9, 175-206. 


1. Compute the arithmetic mean of any or all distributions in Data 4A to 4F inclu- 
sive, using the method that seems most feasible. In Data 42, you will need to make 
some assumption about the cases in the two highest intervals. State your assump- 
tions if means are computed for these distributions. 

2. Compute medians for any or all distributions in Data 4A to 4F inclusive. Why 
is the difficulty experienced with computation of the mean in Data 4/ not also encoun- 
tered in computing the median? 

3. Give the crude modes for all distributions in Data 44 to 47. Compute the 
estimated mode in distributions for which you know both mean and median. 


Data 4G.—Some UNGROUPED DATA 
. 8, 15, 13, 6, 10, 16, 7, 12, 11, 14, 9 
. 12, 10, 18, 13, 4, 8, 17, 15, 6, 14 

. 9, 8, 9, 15, 3, 9, 11, 9, 13 

12, 28, 19, 15, 15, 35, 14, 15 

s T, 18,20, 14, 27, 23, 13,3 


a &a wa 


4. Compute and list the means, medians, and crude modes (where possible) for 
the distributions in Data 4G. 


MEASURES OF CENTRAL TENDENCY 87 


Data 4E.—AGEs OF COLLEGE FRESHMEN Data 4F.—Annunc-tTEst SCORES 
(In terms of average error in millimeters) 
Age at last ” —— CC 
$ Men Women 
birthday Score Men Women 
31-35 1 J 8.0-8.4 1 
26-30 3 6 = 1.:5-1.9 5 
25 7 6 7.0-7.4 2 
24 6 7 6.5-6.9 7 2 
23 fi 7 6.0-6.4 6 4 
22 20 6 5.5-5.9 ti 3 
21 23 16 5.0-5.4 10 9 
20 40 13 4.5-4.9 16 7 
19 88 48 4.0-4.4 18 » 15 
3.5-3.9 19 12 
18 117 67 
17 69 57 3.0-3.4 17 15 
16 2 ó 2:5-2:9 17 13 
Sums. oe ssis o 387 241 2.0-2.4 14 14 
ne ae 1.5-1.9 13 10 
1.0-1.4 8 1 
0.5-0.9 1 
Baoe Kiisa 165 105 


5. For each distribution in Data 4G, tell to which measure of central tendency you 
give first preference and to which, second. Give reasons. 

6. For each distribution in Data 4A to 4F inclusive, tell which measure of central 
tendency you would prefer and which would be your second choice. Give reasons. 

7. Find the weighted mean of the four means: 15, 16, 18, and 21. These means were 
derived from samples in which the N’s were 6, 10, 25, and 20, respectively. Compute 
the unweighted arithmetic mean of the four, for comparison. Interpret your result. 

8. Find the weighted mean of the proportions .25, .30, .32, and .33. These propor- 
tions were based upon samples whose W’s are 44, 32, 18, and 25, respectively. Compute 
an unweighted arithmetic mean of these proportions, for comparison. Interpret your 


results. 
9. Find the geometric mean of the number 2, 9, 15, and 16. Compute the arith- 
Interpret your results. 


meti for comparison. 
oma ey f the work-limit scores 20, 25, 40, and 50. These 


10. Find the harmonic mean o. ) i GU 
i me summated in a series of 120 simple reaction times and 


scores represent the total ti i 
are in terms of seconds. Interpret your resu t 


CHAPTER 5 
MEASURES OF VARIABILITY 


Knowing the central tendency of a set of measurements tells us much, 
but it does not by any means give us the total picture of the sample we 
have measured. Two groups of six-year-old children may have the same 
average JQ of 105, from which we would conclude that, taken as a whole, 
each group is as bright as the other, and we might expect from the two the 
same average level of performance in school or out of school in areas of life 
where JQ is important. Yet when we are told, in addition, that one group 
has no individuals with JQ’s below 95 or above 115, whereas the other has 


S85 % 105 i5 5 B 

1Q 
Fig. 5.1.—Two distributions with the same mean (JQ = 105) but with decidedly dif- 
ferent ranges (and dispersions). 
individuals with 7Q’s ranging from 75 to 135, we recognize immediately 
that there is a decided difference between the two groups in variability or 
dispersion of brightness. The first group is decidedly more homogeneous 
with respect to JQ, and the second is decidedly more heterogeneous. We 
should expect the first group to be much more teachable in that they will 
grasp new ideas at about the same rate and progress at about the same 
rate. We should expect the second group to show considerable disparity 
in speed of grasping new ideas. There will be extreme laggards at the one 
end of the distribution and others at the other end of the distribution who 
may be irked at the slow progress of the group. The distributions for 
two such groups, when plotted, resemble those in Fig. 5.1. 

88 


MEASURES OF VARIABILITY 89 


It is the purpose of this chapter to explain and illustrate the methods of 
indicating degree of variability or dispersion by the use of single numbers, 
just as in the preceding chapter we saw how the central tendency of a 
distribution could be indicated by a single number. The four most cus- 
tomary values to indicate variability are (1) the total range, (2) the semi- 
interquartile range Q, (3) the standard deviation c, and (4) the average (or 
mean) deviation AD." 


Tue TOTAL RANGE 


The total range is the indicator of variability that is easiest and most 
quickly ascertained but is also the most unreliable; thus it is almost 
entirely limited to the purpose of preliminary inspection. In the illustra- 
tion of the preceding paragraph, the range of the first group (from an JQ 
of 95 to an IQ of 115) was 21 IQ points inclusive. The range of the sec- 
ond group was from 75 to 135 JQ points. The range is the distance given 
by highest score minus lowest score, plus 1. From this comparison, we 


draw the conclusion that the second group is considerably more variable 


than the first. 
Why the Range Is Unreliable-—The range is very unreliable for the 


reason that only two measurements are used to determine it. The remain- 
g to do with the estimation of it. In the 
it might have been true that there were 
several [Q’s of 75 and also several JQ’s of 135; but this would be most 
unusual, The chances are great that there would be only one 75 and one 
135. Furthermore, the next lowest JQ might have been 85, with a gap of 
10 points to the very lowest; and the next to the highest might have been 
120, a distance of 15 points from the very highest. Had either or both of 
the persons with 75 JQ and 135 JQ been missing from the group, the range 
would have been something very different from the 61 points actually 
obtained. This is what we mean by saying that the total range is highly 
Some faith can, of course, be placed in it when there is more 
g each of the extreme measurements and when there 
he tails of the distribution. 

When Ranges Should Not Be Compared.—Total ranges should not be 
compared when two distributions have a markedly different number of 
cases. It is quite natural for more extreme cases to show up as we add 
new cases to any sample, so that larger groups should be expected to have 
wider total scatter. This factor is not nearly so important for other indi- 


ing measurements have nothin; 
second group just mentioned, 


unreliable. 
than one case havin 
are no decided gaps in t 


1 The probable error PE has been used as a measure of variability, but it seems rapidly 
to be going out of use and so is merely mentioned in this volume (see the footnote on 


90 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


cators of dispersion as it is for total range. Another caution almost goes 
without saying, and that is the impossibility of comparing ranges in two 
distributions where the units of measurement are not the same. 


THE SEMI-INTERQUARTILE RANGE—Q 


The semi-interquartile range, Q, is one-half the range of the middle 
50 per cent of the cases. First we find by interpolation the range of the 
middle 50 per cent, or interquartile range, then divide this range by 2. 
See Fig. 5.2 for a general picture of the relation of Q to a frequency 
distribution. 


u— Line erected at the 
second guartile (Q3) 
(also the median) 


line erected 
at the first 
quartile (Q,)~, 


„Line erected af ihe 
third quartile (Q3) 


Highest 


Lowest \middle | middle 
Quarter 


quarter |guarter| quarter 


Fig. 5.2.—Illustration of the quartiles Qi, Q2, and Qs, the interquartile and semi-inter- 
quartile ranges, and the quarters of the sample in a slightly skewed distribution. 


Quartiles and Quarters.—When we count up from below to include the 
lowes í first quarter of the cases, we find the point called the first quartile, 
which is given the symbol Q;. Counting down from above to include the 
highest or fourth quarter of the cases, we locate the third quartile, or Q3. 
Incidentally, the median, which separates the second and third quarters 
of the distribution, is also called Qo. Note that the quartiles Q;, Qz, and 
Qs are points on the measuring scale. They are division points between 
the quarters. We may say of an individual that he is in the highest 
quarter (or fourth quarter), and we may say of another that he is af the 
third quartile. We should never say of an individual that he is ina certain 
quartile. i 

Interpolation of Q, and Q3.—In the distribution of ink-blot scores again, 
we locate the third and first quartiles by interpolation (see Table Del): 


MEASURES OF VARIABILITY 91 


TABLE 5.1—DETERMINATION OF Qs, Qı, and Q (THE SEMI-INTERQUARTILE RANGE) 
FOR THE INK-BLoT TEST SCORES 

Scores A 
55-59 ff 
50-54 A 
45-49 3 
40-44 4 

35-39 6<Q; lies within this interval 
T 30-34 7 
25-29 12 

20-24 6<-Q, lies within this interval 
15-19 8 
10-14 2 
N = 50 

Qı = 19.5 + 28 5 = 19.5 + 2.08 = 21.58 


x 
Q: = 39.5 — re xX 5 = 39.5 — 2.92 = 36.58 
5 


One-fourth of the cases (’/4) is 12.5. Counting up from the bottom to 


include 12.5 cases, we find that we need 2.5 out of the 6 cases in the third 
As in earlier solutions, 2.5/6 times 5 gives 2.08. Added 
as the position of Qı. Counting down from the 
d 3.5 cases out of 6 in the fifth class interval. 
Deducted from 39.5, this leaves 36.58 as our 


class interval. 
to 19.5, this gives 21.58 
top, we find that we nee 
Then 3.5/6 of 5 gives 2.92. 


estimate of Qs. . i 
The Interquartile Range and Q.—The interquartile range, or the’ 


4 distance from Q: to Qs, is given by Qs — Qs, or 36.58 — 21.58, which 
equals 15.00. The semi-interquartile range is one-half of this; or 7.5. 


In terms of a formula, 


o= Qs — Qı (Semi-interquartile range) (5.1) 
2 


where Q; = third quartile. 


Qı = first quartile. 4 , j ; 
How Quartiles Indicate Skewness.—It is of interest in passing to take 
note of the relative distances of Qs and Qı from the median, or Qs, in a 


distribution. If the distribution is exactly symmetrical, both the third 
and first quartiles will be the same distance from the median, and that 
distance is Q@. When there is any skewness in the distribution, the two 


92 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


distances will be unequal. If the skewness is positive, the distance 
Qs — Qə will be greater than the distance Q» — Q;. If the skewness is 
negative, the reverse will be true. In other words, skewness is: 


positive when (Q3 — Qs) > (Q2 — Q:) 
negative when (Q; — Q») < (Q: — Q:) 
and zero when (Q; — Q:) = (Q: — Q:) 


The relative sizes of these two distances therefore tells much about the 
direction and the amount of skewness in the distribution. For the ink- 
blot scores, Qs — Qe is 8.4, and Qz — Qı is 6.6. Our inference is that the 
distribution is positively skewed to a moderate degree. In Fig. 5.2 the 
distribution is positively skewed and (Q; — Qə) is greater than (Q2 — Qı). 


THE AVERAGE DEVIATION 


The average deviation, or AD, is the arithmetic mean of all the deviations 
when we disregard the algebraic signs. Every score or measurement in a 
distribution deviates from the mean in that it is a certain distance above 
or below the mean. When and if any measurement coincides exactly 
with the mean, its deviation is zero. Deviations above the mean are 
regarded as positive distances; those below the mean as negative distances. 
In terms of an algebraic definition, 


x+=X-—M (A deviation of a measurement from the mean) (5.2) 


where X = an original score or measurement. 
M = the arithmetic mean. š 

As was pointed out in a previous chapter, the deviations from the mean 
may be regarded as moments about a center of gravity. If we sum the 
deviations, taking into account the algebraic signs, the sum would be zero. 
In other words, Zx = 0. The average of the deviations would also be 
zero, because 2x/N = 0/N, and zero divided by any finite number is 
equal to zero. This kind of an average of the deviations tells us nothing, 
therefore, about their size. We want some indication of their over-all size 
in order to describe the amount of dispersion. The greater the spread 
of the deviations, the greater the dispersion of the distribution. One 
solution is to disregard the algebraic signs of the deviations. In doing so, 
we disregard their direction; we are interested only in their amount. We 
treat them as if they were all positive. In terms of a formula, 


Zjx| 
AD = 2u (The average deviation) (5.3) 


MEASURES OF VARIABILITY 93 


where |x| (with the vertical bars embracing it) = an absolute value of x, 
i.e., disregarding algebraic sign. : 


TABLE 5.2.—CALCULATION OF THE AVERAGE DEVIATION IN UNGROUPED Data 
(Mean = 13.2) 


X |x] 
13 0.2 
17 3.8 
15 1.8 
11 2.2 
13 0.2 
11 2.2 
17 3.8 
13 0.2 
11 2.2 
11 2.2 
18.8 
>Ia| 
18.8 
AD = 5" = 1.88, or 1.9 


To illustrate the solution of an average deviation, consider Table 5.2. 


The sum of the absolute deviations is 18.8. Divided by W, this gives 1.88 


as the average deviation. Because of the small size of N, we should round 


to one decimal place and give the AD as 1.9. 
Interpretation of an Average Deviation.—From the formula and the 


computations it will be seen that when we compute the average deviation 
we are interested merely in the size of the deviations from the mean. We 
ignore their direction. The AD isan arithmetic mean of all the deviations 
of whatever size or direction. Like any arithmetic mean, it stands for all 
the values averaged. In the problem just solved, the 4D tells how much 
on the average the different observations of the auditory limen differed 
from their mean, 13.2. The answer is that on the average these deviations 
were 1.9 cycles, or a little less than 2. 

In samples that are not too small and when distributions approach the 
normal bell-shaped form, we may make the further remark that about 
58 per cent of the observations should be expected to fall within the limits 
1 AD below the mean and 1 AD above the mean. In the threshold prob- 
lem those two conditions are not satisfied; the distribution is neither large 
enough nor symmetrical enough to warrant such a conclusion. If this 
were the case, however, we could say that 58 per cent of the 10 measure- 
ments (6 of them) should be expected between 13.2 — 1.9 = 11.3 and 
13.2 + 1.9 = 15.1. This would include all integral values of 12, 13, 14, 


. 
94 FUNDAMENTAL STATISTICS IN | Y AND EDUCATION 


and 15. Actually, only four of the observations were included within 
those limits, though this should not surprise us, in view of the smallness of 
the sample. i 

Computation of the AD from gga Data.—Although the average 
deviation is not often computed for large, regular samples in ordinary 
statistical practice, it is probably worth demonstrating how this statistic 
can be conveniently computed from data grouped in class intervals. 
Table 5.3 demonstrates this kind of solution. The mean of the 50 ink-blot 


TABLE 5.3.—CoMPUTATION OF AN AVERAGE DEVIATION IN GROUPED Data 


(1) (2) (3) (4) (5) 
Scores x x F fx 
55-59 37 +27.4 1 + 27.4 
50-54 52 +22.4 1 + 22.4 
45-49 47 +17.4 3 + 52.2 
40-44 42 +12.4 4 + 49.6 
35-39 37 + 7.4 6 + 44.4 
30-34 32 + 2.4 7 + 16.8 
25-29 27 — 2.6 12 = 31.2 
20-24 22 = 7.6 6 — 45.6 
15-19 17 —12.6 8 — 100.8 
10-14 12 —17.6 2 — 35.2 

Sums..... 50 425.6 
N z|fal 


test scores represented in Table 5.3 was previously reported as 29.60. 
Ordinarily, one decimal place (or one digit beyond the last at the right 
in the original measurements) will do in the computation of the AD. 

“Column (2) of Table 5.3 presents the midpoints of the intervals. The 
midpoint value represents every measurement in the interval. Column 
(3) gives the deviations of these midpoints from the computed mean. 
Algebraic signs are recorded for the sake of accuracy but they will not be 
needed in the computations. In column (5) are the products of each 
frequency times its corresponding deviation, in other words, each fs 
product. The equation for the AD by this procedure is 


Z| fx 
AD= zif (The average deviation from grouped data) (5.4) 


where f, x, and W are as previously defined, and the fx products are 
summed without regard to algebraic sign. From the data in Table 5.3, 


| 


4 


BeO OF VARIABILITY 05 
425.6 
ae 50 
= 8.512 


which should be rounded to 8.5." ; 
According to the kind of interpretation given previously, we may say 
that if this distribution of scores is close to normal, we should expect 58 
per cent of the scores to lie between 21.1 and 38.1. This would mean 29 
of the 50 scores. Since the data are grouped in Table 5.3, we cannot check 
this conclusion by actual count of the cases, but a rough check can never- 
theless be made. If we assume that the 6 individuals in the interval 35-39 
are evenly distributed, about 4 of them should be below 38.1. If we 
assume, likewise, that the 6 individuals in the interval 20-24 are evenly 
distributed, then 4 of them should be above the point 21.1. With these 
assumptions made, there are 27 cases between the points 21.1 and 38.1. 
This number is 54 per cent of the sample. Fifty-eight per cent would have 
called for 29. The agreement may be regarded as close enough, in view 
of the fact that the sample is not so very large and the fact that it tends 
to be positively skewed. Such a check is often sufficient to tell us whether 


we have made any serious errors in computing the average deviation by 


this method. 
Tur STANDARD DEVIATION 


The standard deviation, or a, is the most commonly used indicator of 
degree of variability, and of the ones described in this chapter it is usually 
the most reliable. That is, it varies least from sample to sample drawn 
at random from the same population. It is therefore more dependable 
and, as an estimate of the dispersion of the population, it is more accurate. 

“Computing the Standard Deviation Directly from Deviations—Like 
the AD, the standard deviation is also a kind of average of all the devia- 
tions about the mean in a sample, though it is not a simple arithmetic 


mean.? The fundamental formula for it is 


Za? (Basic formula for the standard deviation in a sample) (5.5) 


t One check on the accuracy of computations of the fx values is to sum them alge- 


braically, The sum Zfx should equal approximately zero, small discrepancies due to 
fens being tolerated. In Table 5.3, fx equals exactly zero. 
> dard deviation of a sample is symbolized by the double 


2 ks the stan 
eee sere eee In some others it is denoted by the letter s. The lack of agree- 
Soh f Tis "n ` The author believes that the symbols used here are most common 
to aR e et educational literature and hence will be better understood by 
readers in those areas. 


96 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where x = deviation from the mean of the sample. 
N = the size of the sample. 
Some Fundamental Concepts: Sum of Squares; V ariance.—Formula (5.5) 
deserves close study. It calls for several steps in fixed order: 


Step 1. Find each deviation from the mean (x). 

Step 2. Square each deviation, finding x°. 

Step 3. Sum the squared deviations, finding =x*. 

Step 4. Divide this sum by N, finding =x°/N. 

Step 5. Extract the square root of the result of step 4. This is the 
standard deviation.! 


In verbal terms, a standard deviation is the square root of the arith- 
metic mean of the squared deviations of measurements from their mean. 
It has often been called the root-mean-square deviation. But in this sim- 
plified statement lies considerable meaning. Latent in the few steps 
enumerated above lie two statistical concepts that have increasing impor- 
tance. One is the sum of squares, the end result of step three. The 
other is called variance, the end result of step four. These ideas are best 
introduced by means of an illustration. 


TABLE 5.4—Dara ILLUSTRATING Sum oF SQUARES, VARIANCE, AND STANDARD 
DEVIATION 


(1) (2) G) ® 
Person Score Deviation Deviation 
£ Š squared 
z 
as 15 +5 25 
2 14 +4 16 
iS 11 +1 1 
D 10 0 
E 9 i : 
F 7 = 5 
E $ -6 36 
70 = =) 0 = Sx 88 = Sx? 
10:0 0.0 12.57 = V 
Standard Toa. a] amas | praon RSR mng 


In Table 5.4 are listed seven fictitious scores representing a sample of 
seven individuals A to G inclusive. These are denoted by the usual 
symbol, X. The mean of these seven scores, as shown in column (2), 
is exactly 10.0. Column (3) shows the deviations of these scores from 


1 These steps are illustrated in Tables 5.4 and 5.5, and in Fig. 5.3. 


r ee 


MEASURES OF VARIABILITY 97 


the mean. Their sum is zero and also their mean, as is to be expected 
In column (4) we find the squared deviations. Their sum, 88, is the 
sum of squares. Their mean is equal to 12.57, which we have deSad as 
the variance, in this sample. The square root of this is 3.55, the standard 
deviation. All this follows from formula (5.5) and from the steps and 


definitions given above. Let us see what this means in terms of a geo- 
òo 


metrical view of the problem. 


6 Ò §£—o—6 A—_4 
Z6 -5 -4 -3 -2 -I O + +2 +3 +4 +5 +6 
Deviations from the mean 


V/A Variance 
0 | 2 3 4 
Standard deviation 


Fig. 5.3.—Illustration of deviations from the arithmetic mean, their squares, their 
mean (which is the variance), and the standard deviation (which measures the vari- 
ability) in a sample of seven cases. 


| representation of these ideas, see Fig. 5.3. In the 
le of measurement is shown, as usual, in the form 
ding from left to right. Here, however, the original 
rked. The mean has become recognized as the 
en called zero. This is what happens 
m original scores X. All seven indi- 
tions, in correct rank order and at 
We have merely moved the 


For a geometrica 
first diagram, the sca 
of a straight line exten 
score values are not ma 
main reference point and has be 
when we derive deviations * fro 


viduals still retain their relative posi 
ations, as they had before. 


the same separ 
ar scale. 


zero point 10 units up the line 


98 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


So much for representing deviations. It will be seen that the points 
on the line correspond exactly with the values in column (3) of Table 5.4. 
Consider now the squaring of the deviations. Where deviations: them- 
selves are represented by linear distances from a common reference point, 
squared deviations must be represented by areas, namely, squares. The 
squares belonging to the different individuals 4 to G are shown in Fig. 5.3. 
The areas of the squares are equal numerically to the values given in 
column (4) of Table 5.4. It can be seen that the individuals come in the 
same rank order when we compare the squared deviations as when we 
compare x distances. It is also notable how large deviations, when 
squared, increase much more relatively than do small deviations. This 
point will be important to consider later. 

The sum of the squares would be represented geometrically as an area 
equal to a composite of all the squares in Fig. 5.3 I. This could also be 
shown as a square or as a rectangle. Its dimensions could vary some- 
what but its surface would contain 88 units such as those representing 
persons C and E. Finding the arithmetic mean of this large area is 
equivalent to apportioning it equally among the seven individuals. It is 
the amount of area that each person would possess if each one of them 
were given the same amount. This is the variance, which we may repre- 
sent in the form of a square in Fig. 5.3 II. This square is shown on a 
base line like that in the first diagram. Its length of side is the square 
root of its area and represents the standard deviation. 

Some important algebraic relationships, latent in formula (5.5), may be 
called to the attention of the reader. They are all important for general 
orientation in this topic. They may be useful not only in thinking about 
the concepts of sums of squares, variances, and standard deviations, but 
will be found to enter into computations of various kinds later. First, 
one more symbol needs to be introduced. V is often used to stand for 
variance. With this additional symbol given, we can state the following 
interrelationships: 


om alae = W/V (5.6) 
Za? s 

p= w S (Interrelationships of 22°, V, ando) (5.7) 

Zx = NV = No? f (5.8) 


Both V and o, each in its own way, are indicators of amount of dis- 
persion in a distribution. V is said to measure variance, ¢ to measure 
variability. When the sample is one of individuals measured on a com- 


mon scale, either V or ø can become familiar indicators of the extent of 
© 


ha 


MEASURES OF VARIABILITY ’ 99 


the individual differences. To make these concepts more meaningful, 
then, it is well to think of them in terms of measures of individual 
differences. 

Suppose, first, that we have a sample of only one case; with only one 
score. There is no possible basis for individual differences in such a 
sample, and therefore there is no variance or variability. Bring into the 
picture a second individual with his score in the same test or experiment. 
We now have one difference. Bring in a third case and we then have two 
additional differences; three altogether. Bring in a fourth, a fifth, and 
so on. There are as many differences as there are possible pairs of indi- 
viduals. We could compute all these interpair differences and could 
average them to get a single, representative value. We could also square 
them and then average them. It is far more economical, however, to find 
a mean of all the scores and to use that value as a common reference point. 
Each difference then becomes a deviation from that reference point and 
there are only as many deviations as there are individuals. Either the 
variance or the standard deviation is a single representative value for all 
the individual differences when taken from a common reference point. 

Consider the matter from a somewhat different point of view. Con- 
sider giving a certain test of n items to a group of persons. Before giving 
the first item to the group, so far as this test is concerned the individuals 
are all alike. All have scores of zero. There is no variance. This may 
seem absurd, but it has a very reasonable bearing on what comes next. 
Next administer the first item in the test to all individuals in the group. 
Some will pass it and some will fail. Some will now have scores of 1 and 

There are two groups of individuals. 


some still have scores of zero. r ; i 
There is this much differentiation; this much variance. Give a second 
item. Of those who passed the first, some will pass the second and some 


will fail it, unless the two items are perfectly correlated. of those who 
failed the first, some may pass the second and some may fail it. There 
are now three possible scores, 0, 1, and 2. More variance has been intro- 
duced. Carry the illustration further, adding item by item. The differ- 
ll keep increasing, and so, by computation, also the 
bility, as indicated by V and by ø. Psychological 
and educational testing depend almost entirely upon the phenomenon of 
individual differences and therefore upon variance. Probably less than 
one per cent of the tests commonly used yield scores on an absolute scale. 
The significance of any score is ordinarily its usefulness in placement of a 
person somewhere in the group. The greater the variance among the 
Scores, the more accurately (usually) each person is placed. Thus, in 
addition to the use of the standard deviation in describing the spread or 


~ 


ences among scores wi 
variance and the varia 


100 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


scatter of a certain sample, there is its use, as we shall see in later chapters, 
in the evaluation of tests and test items in a number of ways (see Ch. 17). 
After this digression, let us return to the descriptive use of ø and its 
computation in a typical laboratory problem. 

As an illustrative problem in computing ¢ by formula (5.5), let us take 
the 10 measurements of the threshold for pitch (see Table 5.5). Their 


TABLE 5.5.—CALCULATION OF THE STANDARD DEVIATION IN Uncrourep Data 


(1) (2) (3) 

X x m 

Scores | Deviations ý 
13 —0.2 OF 
17 +3.8 14.44 
15 +1.8 3.24 
11 —2.2 4.84 
ig —0.2 .04 
17 +3.8 14.44 
13 —0.2 .04 
11 —2.2 4.84 
11 —2.2 4.84 
11 —2.2 4.84 
51.60 

Sx? 

51.60 e 
o = Vp = V50 = 2.27, or 2.3 


mean we found to be 13.2. The deviations from the mean are given in 
column (2) and their squares, in column (3). Their sum is 51.60. The 
mean of the squared deviations is 5.160. The standard deviation is the 
square root of this, or 2.27. This should not be reported to more than 
one decimal place. In terms of the unit of our measuring scale, this is 
2.3 cycles per second. 

The Interpretation of a Standard Deviation—Now that we have the 
answer 2.3 cycles per second, how shall we interpret it? The usual and 
most accepted interpretation is in terms of the percentage of cases included 
within the range from one standard deviation below the mean to one 
standard deviation above the mean. This range on the scale of measure- 
ment includes about two-thirds of the cases in the distribution. In a 
normal distribution, it is known that from —1¢ (one standard deviation 
below the mean) to +10 (one standard deviation above), exactly 68.26 
per cent of the cases are found. Since most samples yield distributions 


MEASURES OF VARIABILITY 101 


that depart to some degree from normality, we say, “about two-thirds,” 
which is, of course, a little short of 68.26 per cent. Fig. 5.4 illustrates 
the division of the area under a normal curve into regions marked off at 
—io and +10. With two-thirds of the surface within those limits, there 
is left one-third of the area to be divided between the two “tails” of the 
distribution—one-sixth below the point at —1¢ and one-sixth above the 
point at +1. 

In the problem just solved, where we found o equal to 2.3, the distance 
from —1¢ to +1ø on the scale of measurement is 10.9 to 15.5 cycles; i.e., 
the mean 13.2 minus 2.3 is 10.9, and the mean plus 2.3 is 15.5 cycles. 
Within these limits are all measurements of 11, 12, 13, 14, and 15. By 
actual count, there are four 11’s, three 13’s, and one 15, or 8 of the 10 
measurements within these limits, whereas we should have expected 7. 


———— 

-Io +o 
Fig. 5.4.—Approximate fractions of the area under a normal distribution curve (con- 
sequently, fractions of the N cases in a normally distributed sample) that lie within 
: beyond the limits of one standard devia- 


one standard deviation of the mean and also 
tion, in either direction. 
mber of cases and the fact that the distribu- 


not be surprised at this result. In other 
problems this comparison serves as & rough check upon the accuracy of 
computation of c. Tt will not catch all errors but will indicate gross errors 
if the sample is not too small and the distribution is fairly normal. 
Grouping Deviations as a Short Cut.—Some saving in time and effort 
can be afforded in the solution of the standard deviation in data like those 
in Table 5.5, if we group them as in Table 5.6. Since the same measure- 
ment is repeated several times and its deviation from the mean is the 
same every time, and also its deviation squared, we need to find the devi- 
ation and its square only once and multiply Each a? by its frequency. 
The last column of Table 5.6 contains the fx? products, and it will be 
seen that their sum is again 51.60, from which the standard deviation will 


be the same as before. The formula for this reads 


ae [2fx? (Standard deviation from grouped data) (5.9) 


NWN 


But, because of the small nu 
tion is irregular, we should 


where the symbols are defined as before. 


102 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 5.6—CALCULATION OF THE STANDARD DEVIATION IN GROUPED DATA WITH 
THE USE or ACTUAL DEVIATIONS 


AIi oi ® lalag 

X x x? f fe 

17 +3.8 14.44 2 28.88 

15 +1.8 3.24 1 3.24 

13 —0.2 04 3 =P 

11 —2.2 4,84 4 19.36 
51.60 
Zfx? 


A similar treatment may be given all grouped data, in which we let the 
midpoint of each interval be the X for all cases within the interval, and 
this X minus M gives us the deviation of all cases within the interval. 
From here on, the procedure is the same as that in Table 5.6. We shall 
not illustrate the steps by means of a special problem, for there are more 
efficient ways of dealing with grouped data. 

The Standard Deviation by the Short Method.—The short method, 
which was employed in the preceding chapter to calculate a mean 
(Table 4.2), will now be extended in order to compute a standard devi- 
ation. The first steps are carried out exactly as previously to the point 
of finding the mean. The mean itself need not be known (since we are 
dealing with a guessed mean), but the correction is required, as will be 
seen in the following formula:* 


[af mata zN 
omia- ormi ee 


(Standard deviation from grouped 
and coded data) (5.10) 


where z = size of class interval. 


x’ = deviation from the guessed mean in terms of the class interval 
as the temporary unit. 


c = correction in the guessed mean, also in terms of the class inter- 
val as the unit. 


For computational convenience the formula may be varied as follows: 


i f 2 
c= N N > fe — (Xs) (Alternate for formula 5.10) (5.11) 


1 This formula should be better understood by the student who follows the proofs 
given in Appendix A. 


E Pees ae” 


= 


MEASURES OF VARIABILITY 103 


TABLE 5.7.—CALCULATION OF THE STANDARD DEVIATION USING THE SHORT METHOD 
(GUESSED-MEAN PROCEDURE) 


as ik = (Cay = 8 VES — 2508 = 5 4.3696 = 5 X 2.09 = 10.45 


re is illustrated in Table 5.7, which is similar to Table 4.2 
For all class intervals, we need to know the fx’? 
column (5). In each row, the Jx’? prod- 
uct is found by multiplying the corresponding numbers in columns (3) 
and (4); i.e., the first one, 25, is the product of 5 X 5; the second one is 
the product of 4 X 4; and the third, the product of 3 X 9; etc. This is 
because the product fx’? may be factored as (fx’)x’. It is excellent check- 
ing procedure to do the multiplying also by the product (f) X (x?) for 


each interval. ‘ 2. * 
Next we sum the fx’? products to obtain Zfx’*. In Table 5.7, this is 230. 
To find c’, we divide Bfx’ by N. In this case, it is —24/50, which equals 
—0.48. We need ¢?, which is 0.2304. Now, to apply formula (5.10), 
we need next to divide Sfx’? by N, or 230/50, which equals 4.6. Deduct 
c’? from this, or 4.6 — 0.2304, and we have 4.3696. The square root of 
this is called for next, and this is 2.09. The last step is to multiply by 7, 
the size of the class interval; 2.09 X 5 equals 10.45, which is the standard 
deviation we have been seeking. 
w say that about t 
the mean minus 
hese limits are 19. 


The procedu: 
through column (4). L 
products, and these are given in 


wo-thirds of the individuals should be 
10.45 and the mean plus 10.45. Since 
2 and 40.0. Fortunately, for the 


We may no 
expected between 
the mean is 29.6, t 


104 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


sake of checking on this conclusion, these limits are close to the division 
points between class intervals (see Table 5.7). The four intervals included 
within these limits have in them 31 cases altogether, which are 62 per cent 
of the whole group. This is a little short of two-thirds but not unreason- 
ably so.! 

Rough Checks for the Solution of the Standard Deviation—The kind of 
comparison just mentioned is a rough check for the correct solution of 
the standard deviation. If the actual percentage of cases between +1o 
and —1¢ deviates too far from 68 per cent, there is probably something 
wrong with the calculation, and a recalculation is in order. This check 
cannot often be satisfactorily applied with grouped data because the fre- 
quencies from — 1e to +1¢ cannot then be accurately determined. 

Another rough check is to compare the standard deviation obtained 
with the total range of measurements. In large samples (V = 500 or 
more) the standard deviation is about one-sixth of the total range. Stated 
in other terms, the total range is about 6 standard deviations. In smaller 
samples, the ratio of range to standard deviation becomes smaller, as indi- 


cated in Table 5.8. 


Taste 5.8—RAatios oF THE ToTAL RANGE TO THE STANDARD DEVIATION IN A 
DISTRIBUTION FOR DIFFERENT VALUES OF N* 


2.3 4.3 5.9 
10 3.1 50 4.5 6.1 
15 3.5 100 5.0 6.3 
20 3.7 260 5.5 6.5 


* Adapted from Snedecor, G. W. Statistical methods. P.85. Ames, Iowa: Collegiate, 1940. 


In the ink-blot data, since V = 50, we should expect the range to be 
4.5 times the standard deviation. The standard deviation 10.45 times 
4.5 gives us an expected range of about 47 points. Actually the range 
was 46 points, which checks so closely as to give us confidence that our 
standard deviation is at least not grossly in error. 


1 The probable error of a distribution is computed directly from the standard devia- 
tion by the formula PE =-.6745c. It is numerically about two-thirds as large aso, as 
suggested by the ratio .6745. One more multiplication is required; consequently it is 
not quite so easily computed as the standard deviation. Its chief virtue is that in a 
normal distribution, 50 per cent of the measurements lie between the points at — 1PE 
and +1P£ from the mean. The writer has come to feel that this is not sufficient excuse 
for the inclusion of one more statistic to the already lengthy list, particularly when 
the actual middle 50 per cent of the measurements can be more certainly delimited by 
the interval Q; — Qı, the use of which does not assume normality of distribution. 


MEASURES OF VARIABILITY 105 


It may seem strange that we use a less reliable statistic like range as a 
criterion of accuracy of a more reliable statistic like the standard devi- 
ation. The reasons are that (1) there can hardly be any error in com- 
puting such a simple thing as the range, whereas (2) there are chances of 
gross errors in calculating e because of the many steps involved, for 
example, failing to make the final step of multiplying by 7. 

A Summary of Steps for Computing the Standard Deviation.—The steps 
necessary for the calculation of e by the short method are as follows: 


Step 1. Complete Steps 1 through 6 already listed for finding the mean 
by the guessed-average route (see Table 4.2). 

Step 2. Find for every class interval the fx’ product. The most efficient 
way is to compute the product of x’ times fa’ for each interval. 


These products will all be positive. 


Step 3. Sum the fx”? products. 

Step 4. Divide this sum by Ñ, carrying to at least two decimal places. 

Step 5. Find c’, to at least two decimal places. 

Step 6. Deduct the number found in Step 5 from that found in Step 4. 

Step 7. Find the square root of the number found in Step 6, keeping 
two decimal places.' : 

Step 8. Multiply this number by the size of the class interval. If NV is 
large, report two decimal places; if small, round to one decimal 
place. 

Step 9. Interpret the standard deviation in terms of the two-thirds 


principle. 
Step 10. Apply the rou, 
the ratios of Table 5.8. 
d Deviation from Original Measurements.—If the number 
if the measurements themselves are small 
calculating machine is available, the 
ndard deviation is by means of the 


gh check of comparing o with the range and using 


The Standar 
of measurements is not large, 
numbers, particularly when a good 
best procedure for computing a sta: 


formula 
1 Wyr- (0%) (Standard deviation com- 
as D Ta. © ) puted without knowl- (5.12 
o V edge of deviations) we Tele 


in which the essential steps are: 


ach score or measurement. 


Step 1. Square e : 
ments to give >X®. 


Step 2. Sum the squared measure: 
s, it is assumed that we are dealing with integral 
f decimal fractions or multiples of 10 or 100, this 
allowance for the place of the decimal point. 


1 In this, and in the following step: 
Tf they are in terms 0 


measurements. 
er making the necessary 


rule applies only aft 


106 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Step 3. Multiply 2X? by N to give NZX?. 

Step 4. Sum the X’s to find 2X. 
“Step 5. Square the =X to find (=X). 

Step 6. Find the difference NZX — (2X)? 

Step 7. Find the square root of the number found in Step 6. 

Step 8. Divide the number found in Step 7 by N (or multiply it by 1/1). 


On the calculating machine, the X’s and the X?”s can be accumulated 
at the same time according to instructions provided with the machine. 
In tabular form, the solution of this kind is illustrated in Table 5.9. 


TABLE 5.9.—CALCULATION OF THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS AND UNGROUPED DATA 


x X: 

13 169 
17 289 
15 225 
11 121 
13 169 
17 289 
11 121 
13 169 
11 121 
11 121 

132 1,794 
ZX Bx? 

a o = Ko V10(1,794) — 1322 
= 49 V/17,940 — 17,424 
= Ko V516 
(98.7 
~ 40; 
= 2.27, or 2.3 


Grouping Original Measurements—I£ the scores are conveniently 
grouped and their frequencies tabulated, as in Table 5.10, some saving 


TABLE 5.10.—CALCULATION or THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS, WITH GROUPING 


MEASURES OF VARIABILITY 107 


in work can be effected. The steps by which we arrive at =fX and DfX? 
should now be easy to follow by an analogy to the last previous solution. 
Once those values are obtained, Steps 6 to 8 above can be followed to` 


arrive atc. The formula for this procedure is 
e 


1 m Pa -\? (Same as formula 5.12, pz 
aN N 2 a= (sx) with aroused data) (5.13) 
Correction of the Standard Deviation for Coarse Grouping.—We are 
now ready to see more clearly why the number of class intervals should 
not be too small in grouping data or the class interval too large. Refer- 


ence was previously made (p. 61) to a “grouping error.” Let us see 
what the grouping error is and how it affects the standard deviation. 


Actua/ means Migpoints 
of class values of class intervals 


Fic. 5.5.—Illustration of grouping errors resulting from letting the midpoint of each 
class interval represent all cases within the interval rather than using the mean of the 
values for that interval. The smaller the number of intervals, the greater the error. 
e 
This phenomenon is illustrated in Fig. 5.5. There, a distribution is 
drawn with only five intervals. Our computations with grouped data 
thus far have assumed that all the values within an interval may be 
given a class value corresponding to the midpoint of the interval. In 
coarse grouping the midpoint value is not a very exact representative one 
because the cases are not distributed evenly, or even symmetrically, 
within the interval. The only exception to this is the interval that may 
happen to straddle the mean, in which case the midpoint and the average 
of the cases in the class will coincide. A 
In other intervals, note that the frequencies are greater toward the 
limit on the side nearer the middle of the distribution. If we computed 
an actual mean of the cases within each interval, we should find it 
nearer the mean of the entire sample than the midpoint is. The differ- 
ence between the class mean and the midpoint of an interval is the group- 
ing error in that interval. Above the sample mean the grouping errors 
are ordinarily positive (midpoint greater than the class mean) and ‘below 
the sample mean the errors arè ordinarily negative (midpoint less than 


* 


108 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND ED UCATION 


the class mean). The effect of the grouping errors upon the computation 
of a mean is usually almost nil because they are fairly well balanced. 
But their effect upon the average deviation, and especially upon the 
standard deviation, is often large enough to be concerned about. Group- 
ing errors tend to enlarge the standard deviation and the coarser the 
grouping the greater is this systematic error in ø. 

Sheppard’s Correction.—When a correction in g is necessary, Sheppard's 
formula, developed for this purpose, serves very well. When applied to a 
known standard deviation, it reads 


Fry 
c= J 2 (Sheppard’s correction in ø for coarse grouping) (5.14) 


where «æ = standard deviation corrected for errors of grouping. 
o = uncorrected standard deviation computed from data grouped 
in class intervals. 
i = size of the class interval. 
Tt is more convenient to make the correction before proceeding as far as 
the final step in computing. When g’, or wheno”’, is known (as formula 
5.10 is being utilized) we can go directly to the corrected e by the equation 


a = i Vo”? — .0833 (Sheppard’s correction applied to o’) (5.15) 


To start the correction farther back in the operations, as in connection 
with formula (5.13), we have 


SfN? (Solution of ø with Shep- 
= ( : ) — 10833 pard’s correction in- (5.16) 


N cluded) 


It has been stated that when the size of class interval, 7, is equal to 490, 
Sheppard’s correction amounts to only about one per cent. Such an 
error could be tolerated unless very precise calculations are going to be 
done with o after it is computed. If an interval is about one-half ø (i.e. 
49a), as just stated, and if the sample is large, with a range of about 6 
standard deviations, we would then have 12 class intervals. For large 
samples, then, 12 class intervals is a minimum for accurate computation 
of the standard deviation. If there are less than 12, for accurate work 
we should apply Sheppard’s correction. Whether or not we apply this 
correction, therefore, depends upon the size of sample, the number of 
intervals, and the use we intend to make of ø. 

The Standard Deviation of Combined Distributions.—There are times 
when we have two sample distributions, presumably from the same popu- 


MEASURES OF VARIABILITY 109 


lation, or obtained under the same set of conditions, and we want to 
combine them into a single distribution. We have already seen how the 
mean of the combined distribution can be computed from the means of 
the component distributions (Table 4.7). We can also compute the 
standard deviation of the combined distribution from a knowledge of 
their standard deviations, but in doing so we also need to use their means. 


TABLE 5.11.—Two SAMPLE DISTRIBUTIONS (A AND B) AND A COMBINATION OF THE TWO 
(DISTRIBUTION T) 


Frequencies 
X 
Distribution A | Distribution B Distribution T 
il 1 2 3 
10 3 4 7 
9 6 8 14 
8 9 12 21 
v 11 16 27 
6 9 20 29 
5 7 14 21 
4 3 12 15 
3 1 6 7 
2 4 4 
1 2 2 
N 50 100 150 
M 6.96 6.08 6.37 
Sx} 155.92 479.36 991.64 
o? 3.1184 4.7936 4.4072 
o AT 2.19 2.10 
ee a 


Two distributions and their combination are represented in Table 5.11, 
also in Fig. 5.6. An examination of the figure, especially, will show that 
we cannot simply average the two standard deviations of the component 


distributions. A simple arithmetic mean of the two standard deviations 


(1.77 and 2.19) would be 1.98, whereas the standard deviation of the total 


distribution is 2.10. Even a weighted mean of the two standard devi- 
ations would not do. 
If the two samples had the same mean, then deviations of all cases from 
the mean in either sample A or sample B would be identical with devi- 
ations from the mean in composite T. If the two means are different at 
all, this difference contributes to the dispersion of the total distribution. 


110 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Let the distance between Ma (the mean of the A distribution) and M: 
(mean of the composite distribution) be called da, and the distance between 


30 


25 


N 
i=) 


Frequency 
D 


10 
5 
0 
o1 2 F4 5 ATRN 2 JDE R 
Mp Mi Ma 


Frc. 5.6.—Two sample distributions (A and B) on the same scale of measurement and a 
distribution (T) of their combination. Also represented are the mean of the combined 
sample (M,), of each subsample (Ma and M,) and of the deviations of the latter from M, 
(designated as da and d,). 


M, and M, likewise, be called dẹ}. It has been shown that the sum of 
squares for the total distribution can be computed by the equation! 


Due = Fr? 542 nf? 2 (Sum of squares of com- 
Za, = Eata + Ea? + Mad’a + md “ined distribution -17) 


in which Ya, = sum of squares of deviations of all measures in the total 
distribution from the mean of that distribution. 
Ex’a = sum of squares of deviations of all measures in distribu- 
tion A from the mean of that distribution. 
Zx% = similar sum for distribution B. 
Ha = number of cases in sample A, and 7 the number in 


sample B. 
da and d, are as defined above, i.e., da = Ma — M: and 
dy = Ms — M.. 


From the equation it can be seen that if da and d, equal zero (which would 
happen if Ma = M) then the last two terms drop out and we can say 
that the total sum of squares in the composite is equal merely to the 
summation of the sums of squares in the component distributions. 


1 For proof of this, see Appendix A. 


‘directly, 


“- 


MEASURES OF VARIABILITY 111 


Recalling that, in general, x? = Ne? (see formula 5.8), we may write 
equation (5.17) as 
No% = tteo7a + mo + Mad’a + Mody (5.18) 
Dividing both sides of this equation by N and collecting the terms, 


meee 2 r ME Jari i 
w= [itala + d?a) + mlo + @*)] 0 ariane oE pee (5.19) 


Thus, we might say that the variance in the total sample is equal to a 
weighted average of the variances within the component distributions plus 
the weighted variances of the sample means around the composite mean. 
This formula can be extended to include any number of component 
samples. For each additional sample brought into the combination, 
there would be another expression like 7ta (67. + d?a). Taking the square 
root of both sides of equation (5.19) and extending it to include any num- 


ber of components, we have 


V mla F Fa) + moor + da) Fo nelo? + dx) 


g = = 
/N 
(Standard deviation of 
combined distributions) (5-20) 


L 


ds for the last of k samples. 


in which sy (0% + d'r) stan 
i.e., if Ma = m = * °° = My 


If the samples are all of the same size, 
the equation reduces to the form 


ae Leh 4d) t (atda) H o H Or H da] 


(Standard deviation in combined (5.21) 
distributions of equal size) $ 3 


where & = the number of samples. 
To return to the example of Table 5.11, a = 1.77, o = 2.19, as com- 


puted to two decimal places. For the sake of illustration, e; was also 
computed directly from the total distribution and it was found to be 2.10. 
Let us see whether we can arrive at the same value by the use of formula 
(5.20). The work is outlined in Table 5.12. There we have used the 
known means (to two decimal places; at least two places are necessary in 
most cases to give d values to more than one significant digit) to find the 
d values, and the two standard deviations, as given. The four contribu- 
tions to the total sum of squares are given in the last column of Table 5.12. 
The total variance is 4.414, as compared with 4.407 when computed 
and the standard deviation checks to two decimal places with 


that computed directly- 


= 


112 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 5.12—WoRKTABLE FOR THE COMPUTATION OF THE STANDARD DEVIATION OF 
COMBINED DISTRIBUTIONS 


Distribution n | o a? no? 


F 50 1.77 | 3.1329 | 156.6450 
B 100 2.19 | 4.7961 | 479.6100 
| d d? nd? 
A 30 | +0.59 -3481 17.4050 
B 100 | —0.29 0841 8.4100 
Sx", = 662.0700 
a= 4.414 
z= Hi0 


It is probably clear that the work represented in Table 5.12 is less than 
that involved in setting up a total distribution and computing a standard 
deviation from it. It should be stated by way of warning that distribu- 
tions should not ordinarily be combined if either means or standard devi- 
ations differ too much. The answer to the question, “How much is ‘too 
much’?” cannot be given at this stage of the presentation of statistics, 
for the answer must depend upon standard errors of means and of stand- 
ard deviations and upon tests of significance of differences, topics which 
appear in Ch. 9. 

Standard Deviation for Augmented Values.—It should be helpful to the 
beginner in statistics to consider what happens to a standard deviation 
when we make certain systematic changes in measurements. We will con- 
sider two different changes: (1) adding a constant to each score in the 
sample and (2) multiplying each score by a constant. In both instances 
we can predict exactly what will happen. An illustration will be given to 
show what happens. Mathematical proofs are given in Appendix A. 

Let us use a simple problem of five scores, as in Table 5.13. The same 
effects could be shown with any five scores we were to choose. The five 
scores are 11, 9, 8, 7, and 5. Their mean is 8.0 and their standard devi- 
ation is 2.0. First, let us add the constant 5 to every score. The scores 
so augmented are listed in column (4) and are denoted as X’. The mean 
of the X’ scores is 13.0, which is 5 units higher than the mean of the X 
scores. This illustrates the fact that if we add a constant to all scores in a 
sample the mean is increased by the same constant. In terms of an equation, 


M x40) = Mz +C (Mean of X values each plus a constant C) (5.22) 


Notice the deviations x’ [column (5)]. They are identical with the 
deviations « [column (2)]. Augmenting each value by adding a con- 


: 


MEASURES OF VARIABILITY 113 


TABLE 5.13.—THE STANDARD DEVIATION WHEN Scores ARE AUGMENTED BY ADDING A 
CONSTANT OR BY MULTIPLYING BY A CONSTANT 


(1) (2) (3) (4) (3) (7) (8) (9) 
2 ' 3 x’ a x” n we 
X z x (x +5) x (3X) x x 
| 
11 +3] 9 16 +3) 9 33 +9 81 
9 +i | ił 14 +1] 1 27 +3 9 
8 Oo; 0 13 0| 0 24 0 0 
7 —1ļ| 1 12 —1/; 1 21 —3 9 
5 -3 9 10 —3 9 15 -9 81 
Sums ....40 20 65 20 | 120 180 
Means....8.0 4.0 13.0 4.0 24.0 36.0 
Standa iations. . 2.0 2.0 6.0 
andard deviations 


stant has not changed the deviations at all. Nor has the sum of squares 
changed, nor the variance, nor the standard deviation, which is still 2.0. 
We could have changed every X by deducting a constant from it (which is 
augmenting it with a negative amount) and the same result 

When each value in a sample is increased by a constant 
o change in the standard deviation. In terms of an 


equivalent to 
would follow. 
increment, there is n 
equation, 


o(x4c) = ox (Standard deviation of X values each plus a constant C) (5.23) 


Let us next multiply each X by 3. The results are given in column (7) 
under the heading X”. The mean of the X” values is 24.0, just three 
times the mean of the X values. The general principle is that when all 
measurements are multiplied by a constant, the mean is also multiplied by 


the same constant. In terms of an equation, 
Mex = CMz (Mean of X values each multiplied by a constant C) (5.24) 


What happens to the deviations from the mean under this circumstance? 
The deviations «” [column (8)] are 3 times the corresponding x deviations. 
The x”? deviations are 9 times the corresponding g? values. The sum of 
squares is also 9 times 3x2, The variance is 36, which is 9 times the 
variance in the original distribution. | The standard deviation is 6.0, 
which is 3 times that for the original distribution. — The general principle 
is that when all measurements in a sample are multiplied by a constant the 
standard deviation is also multiplied by the same constant. This is also true 
when the constant is a fraction, i.e., some value less than 1. The equation 


that describes this is 
Standard deviation of 
ocx = Coz ( etant C) 


X values each multiplied by a (5 25) 


114 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


These principles kept in mind will make more meaningful many things 
that follow later. They also explain what happens in the “short” 
methods of computing a mean and a standard deviation. In choosing a 
guessed mean, we virtually deduct the value of that guessed mean from 
every measurement in the sample. This operation has an effect upon the 
computed mean but not upon the computed standard deviation. In 
assigning class values (x’) to the measurements, we virtually divide by a 
constant (i), which is equivalent to multiplying by 1/i. The various 
corrections we make in applying formulas (4.4) and (5.10) bring the mean 
and the standard deviation back to the right levels consistent with the 
original measurements. 


DESCRIPTIVE USE OF STATISTICS 


Thus far, the chief uses proposed for measures of central tendency and 
of dispersion have been as simple values descriptive of total distributions. 
This is best appreciated when we compare different samples. As an illus- 
tration of this, see Table 5.14, in which we have a few samples of Army 
General Classification Test data, each based upon a different civilian 
occupational group. We will not concern ourselves at the moment with 
the question of how adequate these particular samples are either for size 
or for representativeness of the populations from which they are purported 
to come. These considerations are, of course, important if we want to 
generalize our conclusions to those populations. We can still compare 
samples as such. 

Some general conclusions can be drawn from the inspection of Table 
5.14. When the means and medians are placed in rank order, it will be 


Taste 5.14—Sratistics DESCRIBING DISTRIBUTIONS or SCORES FOR SELECTED 
OCCUPATIONAL Groups WHo TOOK THE ARMY GENERAL CLASSIFICATION TEST 
. DURING Worip War II* 


Occupation N M Mdn o Range 
Accon Ttan boss we eeni andes saw we 172 128.1 128.1 11.7 94-157 
anye aedes aan nw Eo 94 | 127.6 | 126.8 | 10.9 96-157 


Reporter... 45 124.5 125.7 iT 100-157 
Sales clerk.. 492 109.2 110.4 16.3 42-149 
Plumber... 128 102.7 104.8 16.0 56-139 


Truck driver. 817 96.2 97.8 19.7 16-149 
Farm hand... wate] Sie 91.4 94.0 20.7 24-141 
Pea SLEL cca oo vrais Dew ate ria tava 77 87.7 89.0 19.6 45-145 


* From Harrell, T. W., and Harrell, M. S., Army General Classification Test scores for civilian 
occupations. Educ. & Psychol. Meas., 1945, 5, 229-240. By permission of the publisher. 


MEASURES OF VARIABILITY 115 


seen that the occupational groups fall into an approximate rank order for 
socioeconomic level. It is also apparent, as should have been expected, 
that occupations requiring more “headwork” are highest in the list. The 
test emphasized verbal, reasoning, and numerical facilities. 

The importance of having both means and medians lies in the infor- 
mation they give concerning skewness. For the lower occupational 
groups, particularly, the medians are slightly higher than the means. 
This indicates slight negative skewing. This is a somewhat surprising 
result, for one would expect that the higher the mean the greater the 
negative skewing, and the lower the mean the greater the positive skewing. 
When a test of moderate difficulty is administered to a group of low aver- 
age ability, scores tend to bunch at the lower end of the scale (positive 
skewing). When the same test is given to a group of high average ability, 
the bunching is expected near the upper end of the scale (negative skew- 
ing). Since in the data of Table 5.14 the skewing seems to be negative 
for most occupational groups and most marked for those of low average 
ability, some explanation is demanded. We can only speculate, which 
means we can suggest several hypotheses which would need further investi- 
gation in order to evaluate their worth. One hypothesis might be that in 
particularly among those of lower ability in the 
ere very poorly motivated or took the 
at they did not perform up to their 


any occupational group, 
test, a minority of the examinees W 
test under adverse conditions so th 


characteristic level. 4 nae 
Two indices of dispersion are given; the standard deviation and the 


total range. Each tells its own story. Standard deviations are more 
meaningful here if it is remembered that for the total range of scores, all 
occupational groups combined, the standard deviation was approximately 
20.0. The scaling which was utilized aimed at a standard deviation of 
20.0 and a mean of 100. The mean in some forms of the test turned out 
to be somewhat above 100. We would ok paler La ray 
2 an the dispersions for all occupation 
sis agg praan be sons Table 3.14, this is true. OE the 


i ith three exce : 
ane VAA i ] group and the higher the mean, the 


i the occupationa 
whole, the higher mS higher groups should not be expected to 


smaller dispersion. 
mena mean, because the mean score approaches the 
, 


scatter so far from the à z ; 
highest scores made by individuals in any group. We might expect a 


similar curtailment for groups with lowest means. But a study of the 


i this did not occur. 
k op are surprisingly large for all groups. It is hard to 


"R as such 3 k 
ima se E in the professional groups with scores below the 
eee erent unless those scores were low because of poor motivation 

age, 


116 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


or because of advancing age, which is associated with slower rate of work. 
The test was a speed test. The lowest scores for the lower occupational 
groups are in line with expectations, but the maximum scores in those 
same groups are illuminating. Many a clerk or truck driver could evi- 
dently have successfully undertaken training for one of the professional 
occupations. In their prewar assignments they for some reason did not 
take full vocational advantage of their abilities. It is this fact and also 
the fact that men of very low academic abilities can engage successfully 
in the occupations like farm hand and teamster that are largely responsible 
for the unusually wide dispersions of scores in such occupational groups. 

In this discussion we are not particularly interested in settling points 
concerning the relation of mental abilities to occupational Jevel or success. 
The data were presented here merely as an illustration of the kind of 
inferences one may draw from a set of statistics and the hypotheses that 
may be set up for further investigation, possibly of a very fruitful nature. 
Such inferences and hypotheses would be impossible to make without this 
kind of inspection and the inspection is made possible by having the 
statistical information. 


USES AND INTERRELATIONSHIPS OF DIFFERENT MEASURES 
or DISPERSION 


Choice of the Statistic to Use.—Several considerations come into the 
picture when we decide what measure of variability to employ in any 
situation. One is the reliability of the statistic; its relative constancy in 
repeated samples. In this respect, the statistics come in the order, from 
most reliable to least reliable: standard deviation, average deviation, 
semi-interquartile range, and total range. So far as quickness and ease 
of computation are concerned, the four are almost in reverse order to that 
just given. If further statistical computation is to be given the data, 
such as estimating reliability of the mean and of differences between 
means, computing coefficients of correlation, regression equations, and the 
like, then the standard deviation is by all odds the one to employ. 

As between standard deviation and average deviation, there is some- 
times a choice. The standard deviation, because it derives from squared 
deviations, gives relatively more weight to extreme deviations from the 
mean. Ifa distribution should have an unusual number of extreme cases 
in one or both directions from the mean, some investigators prefer the 
average deviation to the standard deviation. This rule includes cases of 
markedly skewed distributions. 

The semi-interquartile range gives even less importance to extreme 
deviations than does the average deviation and would sometimes be 


MEASURES OF VARIABILITY 117 


given preference to both standard and average deviations for this reason. 
It gives more importance to the central mass of cases. When the median 
is the measure of central tendency adopted, Q should naturally be the 
companion measure of variability. Both are based upon the same prin- 
ciples. When distributions are truncated, or have some indeterminate 
values, only Q can justifiably be used to indicate variability. 


To recapitulate, 


1. Use the range when 
a. The quickest possible index of dispersion is wanted. * 
b. Information is wanted concerning extreme scores. 
2. Use the semi-interquartile range, Q, when 
a. The median is the only statistic of central tendency reported. 
b. The distribution is truncated or incomplete at either end. 
c. There are a few very extreme scores or there is an extreme skewing. 
d. We want to know the actual score limits of the middle 50 per cent 


of the cases. 
3. Use the average deviation when 
a. There are extreme deviations which, when squared, would bias 
estimation of the standard deviation. 
b. A fairly reliable index of dispersion is wanted without the extra 
labor of computing a standard deviation. 
c. The distribution is nearly norma] and we can therefore estimate 
c from the AD (see formula 5.28). 
4. Use the standard deviation when 
a. Greatest dependability of the value is wanted. 
b. Further computations that depend upon it are likely to be needed. 
c. Interpretations related to the normal distribution curve are 
desired. It will be found in a later chapter that the standard 
deviation has a number of useful relationships to the normal curve 


and to other statistical ideas. 


Relationships among the Measures of Dispersion.—Previously, the 
standard deviation was related roughly to the range of measurements in a 
sample. In the general run of samples one meets in statistical work, the 
range varies from 4 to 6 times the standard deviation (see Table 5.8), 
depending upon the size of sample. If the distribution with which we 
deal is normal, or nearly normal, in form, we can use a number of other 


relationships. In a strictly normal distribution the following relation- 


ships hold: 


118 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Q= 845AD = .67450 (Conversion of one measure of disper- (5.26) 
AD = 1.1830 = .798¢ sion into another, assuming a nor- (5.27) 
c = 1.4830 = 1.253AD mal distribution) (5.28) 


These equations are most useful for checking purposes when for some 
reason we have computed two or more of the statistics. They are also 
useful in estimating one measure of dispersion from another when we do 
not take the trouble to compute more than one. This should be done 
only with great caution, however, being assured both that the dis- 
tribution is close to normal and that the one computed statistic is correct. 


Tue COEFFICIENT OF VARIATION 


Absolute versus Relative Variability——Measures of variability are not 
directly comparable unless they are based upon the same scale of measure- 
ment with the same unit. It is even questionable whether one should 
compare absolute variabilities on the same measuring scale when two 
groups have decidedly different means. For example, the variability in 
height of infants might naturally be expected to be less than the vari- 
ability in height of adults. If we are interested in comparing the vari- 
ability in height of infants, as infants, with variability in height of adults, 
as adults, we need to consider infant and adult norms. These norms are 
naturally given in terms of means or medians. We are here concerned 
with relative variability rather than absolute variability. The question is 
more correctly stated by saying, “Is the variability of infants’ heights in 
ratio to their mean as great as the variability of adults’ heights in ratio 
to their mean?” We therefore need to know the ratio of the standard 
deviation to the corresponding mean. It is customary to multiply this 
ratio by 100, which tells us what percentage of the mean the standard 
deviation is. The formula is 

cy = un (Coefficient of variation) (5.29) 

Relative Variability and Weber’s Law.—One important application of 
the coefficient of variation is in the field of psychophysics. If we ask an 
observer to duplicate a 90-mm. line by free-hand drawing 50 times and if 
we then compute the mean and standard deviation of his reproductions, 
we may expect a mean something like 107 mm. and a standard deviation 
of about 5mm. His coefficient of variation is 4.7; or, in other words, his 
variability is 4.7 per cent of his mean. In duplicating a line of 180 mm. 
50 times, let us say that his mean is 195 mm. and his standard deviation 
is8mm. The variability has increased as well as his average. According 
to Weber’s law, it should have kept in step with his increase in average 


MEASURES OF VARIABILITY 119 


and the coefficient of variation should consequently be the same. CV is 
now 4.1 per cent, or almost the same as before, but is perhaps lower than 
Weber’s law requires. Results in the past have typically shown that with 
increasing mean, the absolute variability does increase though not so 
rapidly in proportion, so that the relative variability decreases and does 
not remain constant, as according to Weber’s law. We are not con- 
cerned here particularly with the validity of Weber’s law except-as it 
illustrates the importance of relative variability. 

When Not to Apply the Coefficient of Variation.—One important word 
of caution is necessary concerning the application of CV. It should not 
be applied unless we are rather certain ihal our measuring scale is one of 
equal units and, above all, unless the absolute sero point is taken into account. 
These qualifications almost entirely confine us to measuring scales with 
physical units, such as linear distances, weights, and time. They rule 
T d examination scores, even mental-age and JỌ units, 


out ordinary test an E A ‘ 
and thus materially reduce the areas of application of CV in psychological 


investigations. f E 
To illustrate the seriousness of this, let us note a fictitious but not 


unreasonable example. Ina certain psychological test composed of items 
the mean is 8.5 and the standard deviation is 3.4. The coefficient of 
variation would be 340/8.5 = 40.0. The standard deviation is 40 per 
cent of the mean. But remember that scores on such tests do not repre- 
sent distances from a meaningful or absolute zero point. Let us assume 
that an obtained score of zero on this test actually represents an ability 
that is 12 units above the genuine zero point; 12 units of the same order 
of magnitude of the units within the obtained range of scores. On such 
an “absolute” scale, the mean of the scores would be 20.5 rather than 8.5. 
The standard deviation would remain the same, 3.4, since we have in effect 
merely added 12 points to each person’s score and have not disturbed the 
scores’ relative positions. The CV now becomes 340/20.5 = 16.6, or less 
than half what it was before, while the absolute variability has remained 
a ae ficient of variability has entered into the controversy concern- 
ing the relation of variability to learning. This has been an important 
problem because it was maintained that if variability increases as a group 
of individuals indulges in a like amount of practice ina skill or habit, 
there is support for the hypothesis of hereditary determination of the 


abilities underlying the skill. If variability decreases, it is contended 


jronmentalist’s hypothesis concerning deter- 


th e lt favors the env] h 
erio ens abilities. Questions have arisen as to whether measures 


of variability should be absolute (standard deviation) or relative (coefñ- 


122 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


reason is that the frequency to be given corresponding to it will be all 
the cases within the class and below it. All those cases fall below the 
exact upper limit of the class. In column (3) are given the ordinary fre- 
quencies and in column (4), the cumulative frequencies. The cumulation 
is started at the bottom of the list in column (3). Below the upper limit 
of the lowest interval (14.5) are 2 cases. Below the upper limit of the 
second interval (19.5) are these 2 plus the 8 in the second interval, giving 
10 as the cumulative frequency. In the third interval, we find 6 cases 
to add onto what we already have, making 16 for the third interval. 
And so it goes, each cumulative frequency being the sum of the preceding 
one and the frequency in the class interval itself. This continues until 


50 


Frequencies 


0 L 
0 5 10 Ib 20 25 30 35 40 45 50 55 60 
: Scores 
Fic. 6.1.—A cumulative frequency distribution curve for the ink-blot test. 


the last (top) interval is reached. The last cumulative frequency should 
be equal to N (here it is 50); if not, some error has been made. 

Plotting the Cumulative Distribution.—Figure 6.1 shows the cumulative ` 
frequencies we have just obtained in Table 6.1, plotted against the corre- 
sponding scores (exact upper limits). The plotting here follows much the 
same routine as prescribed in Ch. 3, except that here we never plot the 
histogram form, only the type that connects neighboring dots with straight 
lines. Obviously we do not obtain a polygon but rather an S-shaped 
curve. In order to bring the curve to the base line at the left, we assume 
that a zero frequency comes at the lower limit of the bottom class interval 
(which is the same as the top of the interval just below it). As before, 
the total figure is about 60 to 75 per cent as high as it is wide. 

Determining Quartiles Graphically.—It is of interest to point out here 
the ease with which the quartiles can be graphically determined or read 
off the curve in Fig. 6.1. To find the median (Q), we first locate the 


CUMULATIVE DISTRIBUTIONS AND NORMS 123 


frequency of 25 (N/2) on the vertical axis. Draw a horizontal line over 
to the curve at this level. At the point where it intersects the curve 
drop a perpendicular to the base line. Where this cuts the base ine, 
read the score value. On ordinary graph paper, Qə can be read accu- 
rately to one decimal place. Q, would be similarly determined at the 
level of 12.5 on the frequency scale and Qs, at the level of 37.5. 
Distribution of Cumulative Percentages and Proportions.—Previously 
we have had reason to transform frequencies into percentages for the sake 
of comparing two distributions where N differs (Ch. 3). The same reason, 
plus more important ones, prompts us more frequently to transform cumu- 
lative frequencies into percentages. In Table 6.2, another example of 


TABLE 6.2—CUMULATIVE FREQUENCIES, PERCENTAGES, AND PROPORTIONS FOR 
MEMORY-TEST SCORES 


(1) (2) (3) (4) (5) (6) 
Cumulative 
Scores xX f of % cp 
cP 

41-43 43.5 1 86 100.0 1.000 
38-40 40.5 4 85 98.8 988 
35-37 37.5 5 81 94.2 .942 
32-34 34.5 8 76 88.4 . 884 
29-31 31.5 14 68 79.1 «791 
26-28 28.5 17 54 62.8 628 
23-25 25.5 9 37 43.0 430 
20-22 22.5 13 28 32.6 .326 
17-19 19.5 8 15 17.4 .174 
14-16 16.5 3 i 8.1 .081 
11-13 13.5 4 4 4.7 .047 
g-10 | 10.5 0 0 0.0 .000 


cumulative frequencies is given. They are obtained here [column (4){ 
just as before. We now wish to find what percentage of 86 each cumu- 
lative frequency is. The arithmetic is simply a matter of multiplying 
each cumulative frequency by 100/N. This fraction, 100/86, is equal to 
1.1628.- It is well here to keep a liberal number of decimal places. In 
Table 6.2, the cumulative percentages in column (5) are obtained by 
mulifplying each frequency in column (4) by 1.1628. These need not be 
given to more than one decimal place. Sometimes it is preferable to 
work in terms of cumulative proportions, which are given in column (6). 


124 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Whereas with percentages the base is 100, with proportions the base is 
1.00. Each proportion is therefore simply 1400 of the corresponding per- 
centage. Thus, cp = .011628 X cf. The reason for using proportions 
will be explained later; here we shall be concerned with percentages. 
The Cumulative Percentage Curve, or Ogive.—In Fig. 6.2, the cumu- 
lative percentages we have just obtained in Table 6.2 are plotted as points 
against the corresponding score points (exact upper limits of class inter- 
vals). Again, an S-shaped curve results. Now that it is standardized 
as to height, it is sometimes called an ogive. The ogive is, in other words, 
the cumulative percentage distribution curve. Two ogives are much more 
readily compared than two ordinary cumulative curves because of their 


common height. But this is not the only use of an ogive, as we shall 


soon see. 


CENTILE NORMS 


Finding Centile Points by Interpolation.—A centile point, or centile, is a 
value on the scoring scale below which are any given percentage of the cases.” 
For example, the 90th centile is the point below which are 90 per cent 
of the scores, and the 24th centile is the point below which are 24 per 
cent of the scores.’ 

Deciles and Tenths.—We have already seen how to interpolate in order 
to compute a median and other quartiles. Actually, the median is at the 
50th centile, Q, is at the 25th centile, and Qs is at the 75th centile. It is 
but a step further to generalize this to any centile one desires. We could 
choose to interpolate any centile; the 63d, the 81st, or the 8th. Our 
interest in testing happens to stress the centiles that are multiples of 10— 
the 90th, 80th, 70th, etc., down to the 10th. These are called the deciles, 
for they divide the distribution into tenths, just as the quartiles divide it 
into quarters and the median, into halves. 

The Process of Inter polation.—The principle of interpolating is not new. 
Table 6.3 shows how we may work out the deciles systematically. The 
complete headings of the table make the work almost self-explanatory, 
but let us follow through one or two examples. First we need to know 
how many cases out of the total of 86 we need to include in any given 


1 The ogive may also be in terms of cumulative proportions, since proportions and 
percentages are used interchangeably. 

2 The term centile is often called (superfluously) percentile in the literature. There is 
about as much excuse for speaking of perdecile or of perquartile. 

3 The term centile, without reference to a scale of measurement, should mean centile 
Thus, to say that an individual is at the 24th centile indicates his rank among a 
Being better than 24 of a hundred, he would rank 25th from the 


rank. 
hundred persons. 
bottom. 


CUMULATIVE DISTRIBUTIONS AND NORMS 125 


percentage. Ninety per cent of 86 is 77.4, which we find in column (2). 
We must count up the scoring scale among the frequencies until we 
include 77.4 cases. Reference to Table 6.2 shows that we get by accu- 
mulation 76 cases up to the score point 34.5. We need 1.4 more cases 


TABLE 6.3—CALCULATION OF CENTILES, OR CENTILE POINTS BY INTERPOLATION IN 
THE MEMORY-TEST DATA 


a) (2) (3) (4) (5) (6) 
Cumulative 
frequency is gal 
Percentage Number of actually es sn Distance 
below the cases below | below the contani of centile The centile 
centile the centile interval i a point above point 
point point containing j int lower limit 
the centile po 
point 
K 14x3 
90 77.4 76 35 + =X 35.3 
"8X3 
80 68.8 68 31.5 + as 31.8 
6.2 X 3 
70 60.2 54 28.5 + i“ 29.8 
14.6 X 3 
60 51.6 37 25.5 + “eX 28.1 
6X3 
50 43.0 37 25.5 + exi 26.6 
6.4 X 3 " 
40 34.4 28 2.5 + 2R Tag: 
10.8 X 3 
30 25.8 15 19.5 + DRS 22.0 
2:2 3 
20 17.2 15 19.5 + ix 20.0 
1.6 X 3 
10 8.6 7 16.5 + 3 17.1 
D ee ee E 


erval. There are 3 score units in the 


š xt higher int j 
among the 5 in the next mg 5 times 3, or, as given in columns (4) 


interval; so we have to proceed 1.4/ 14X3 : P 
and (5) of Table 6.3, we add to 34.5 the amount =g => which gives 


i i hat Ps (90th centile) equals 35.3. 
35.3 a tile point. We say t ; 
one ‘phi m ates let us solve for Pio. Ten per cent of 86 is 8.6. 
Co ne siia ai point of 16.5, we find 7 cases, which leaves us 
a 6 3 ke of the 8 in the next interval. Pao is therefore equal 
ing 1.6 m 


126 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


to 16.5 + 16 x? which equals 17.1. The remaining centiles are simi- 


larly determined and are listed in the last column of Table 6.3. 

The Utility of Centile Norms.—tTest scores of various kinds are fre- 
quently interpreted in terms of centile norms, for very good reasons. In 
the first place, a raw score of so many points means very little tous. Tell 
a student’s adviser that his advisee made a score of 59 points in an algebra- 
achievement examination, 175 points in an English-achievement exami- 
nation, and 121 points in a general scholastic-aptitude test, and without 
further information the adviser does not know whether his advisee is low 
in all tests, high in all tests, or low in one or two and high in the remaining. 
But tell him that a score of 59 points in algebra is at the 99th centile, the 
175 points in English is at the 32d centile, and the 121 in scholastic apti- 
tude is at the 48th centile, when those centiles were established by the 
scores from 1,500 freshmen entering the University with the advisee in 
question; then he will have some usable information. The student in 
question is extremely high in algebra, moderately low in English, and 
about average in general scholastic aptitude. The chief utility of centile 
norms is (1) to give some conception of the general level of a score in a 
known population, and (2) to put scores from different tests on a com- 
parable basis. 

Finding Centile Norms by Interpolation.—If we wished to have a table 
of centile norms for the memory test, we could now use the nine decile 
points already found by interpolation as they are listed in the last column 
of Table 6.3. Then when a student came along with a score of 22 we 
could say that he is at the 30th centile; another student with a score of 
30 is at the 70th centile, etc. When a score came up that is not exactly 
listed we could find its centile equivalent by interpolation. For example, 
a score of 21 would be at the 25th centile, and a score of 27 would be at 
about the 53d centile. 

Centile Norms from Smoothed Ogives.—But there are objections to 
the use of interpolated centiles as norms. Chance irregularities in dis- 
tribution from a small sample often give a distorted picture of the true 
situation that probably obtains in the larger population. After all, it is 
the larger population that we wish to represent in our norms, or at least 
we should like to compare future individuals’ scores with something more 
stable and general than our limited sample. For this reason the author 
strongly recommends that centile norms be set up in terms of the smoothed 
ogive. Interpolated norms are derived from the unsmoothed curve and, 
as was said, they are affected by minor irregularities that are probably a 
peculiarity of this sample only and not of the general population. The 


CUMULATIVE DISTRIBUTIONS AND NORMS 127 


smoothed ogive may be taken as an estimation of the distribution of the 
general population of which our group is a sample. When a sample is 
large, very little smoothing is necessary. Even with small samples, at 
times surprisingly little smoothing need be done. i 

In Fig. 6.2, a smoothed ogive (by inspection and free-hand drawing) 
has been drawn. The aim is to bring it as close as possible to all points 
and if points must be untouched by the curve, there should be about a 
many below the curve as above it. If too glaring discrepancies occur 
between points and curve after smoothing, it is probably best to discard 
the attempt to use these data as a basis for norms or else to add more 
cases until sampling irregularities are greatly reduced. 


100 


œ 


a 
o 


è 


Cumulative percentages 


DS) 
(=) 


0 
10 1S 20 25 30 35 40 45 
Scores in a memory test 
Fic. 6.2.—Smoothed cumulative distribution curve for the memory-test scores. 
quencies are in terms of percentages. 


Fre- 


Reading Centile Scores from a Graph.—Having satisfied oneself as to the 
smoothed ogive, the next step is to read off the diagram the score points 
corresponding to the centile ranks for which norms are required. For 
this purpose the diagram should be enlarged sufficiently for easy reading 
and the graph paper finely ruled so that score points may be accurately 
read to one decimal place. In Table 6.4 are given the score points corre- 
sponding to centiles 10 to 90, as before, but also to 95 and 99 at the 
upper end and to 5 and 1 at the lower end. The reason for including 
these extra points at the extremes is that there is actually a great range 
e 90th centile and also below the 10th centile. In fact, 
y is about as great beyond the 90th centile as it is 
he 90th centile, and as great below the 10th centile 
e mean, when the distribution is normal. 

e defect of the centile scale, as a measur- 
ndividual differences, relatively, near the 
d with those near the ends. Giving 


of ability above th 
the range of abilit 
between the mean and t 
as between that point and th 

A Defect in Decile Scales—On 
ing scale, is that it exaggerates 1 
center of the distribution as compare 


128 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 6.4.—CENTILE NORMS FOR THE MEMORY TEST, DERIVED FROM THE SMOOTHED 


OGIVE 
Centile | Score point | Integral score 
99 40.5 41 
95 37.1 38 
90 34.9 35 
80 31.8 32 
70 29.5 30 l 
| 
60 27.9 28 
50 26.1 27 
40 24.3 25 
30 22.5 23 
20 20.4 21 
10 17.5 18 | 
5 14.9 15 
1 11.9 12 l. 
—— ee 


score norms corresponding to selected centiles beyond 10 and 90 compen- 
sates for this defect to a large extent. Because of this same defect, it is 
not the best practice to work with decile norms, for to do so often leads 
the user of the norms to lay too much stress upon differences among the 
great average group and too little upon those where tests discriminate best. 


Distribution based upon 
scores (unimodal) 


Distribution based upon 
deciles (rectangular) 


Scales of ability (scores, also decile ranks) 
Frc. 6.3.—Shoy 


. owing how, when a distribution is converted to one with decile units on the 
base line, a distribution that 


was unimodal, and perhaps normal, becomes rectan ular. 
The areas of the two distributions are approximately equal. i : 


Figure 6.3 illustrates how a decil 
scale. This figure is so drawn that 
total range as the original scores, 
so that the total area in the 10 cat 
the original curve. The new fre 
are given equal distances on the 
as if we had pressed down upo 


e scale distorts differences along the 
the 10 decile divisions cover the same 
The heights of the rectangles are drawn 
egories combined is equal to that under 
quency distribution, when decile ranks 
measurement scale, is rectangular. It is 
n the center of the original distribution, 


CUMULATIVE DISTRIBUTIONS AND NORMS 129 


forcing the central individuals farther apart, and to make up for it, we 
group individuals who are spread over the tails of the original curve into 
narrower categories. 

Another illustration of the distorting effect of decile and centile scales 
when we give equal distances to numerically equal intervals is shown in 
Fig. 6.6. Here are shown parallel scales for the memory test. Corre- 
sponding centile ranks and raw scores are connected by dotted lines. 
From this it will be seen, in another way, how raw-score distances near 
the center become relatively spread and how equal distances near the 
extremes are relatively condensed when converted to centile-rank values. 

It is probably best that decile norms, as such, be consigned to the 
limbo of forgotten procedures. In their place the author recommends 
the use of a C scale, which will be described in a later chapter (Ch. 12). 
Centile norms will continue to be useful, but it is urged that they be 
constructed in a way that will give more correct impressions of scale 
positions, as will now be described. 

Integral Centile Points —Before doing that, however, a further word of 
explanation of Table 6.4 is in order. The last column of “integral scores” 
is merely a revision of the second column by way of rounding to whole 
numbers. Tables of norms are frequently given in terms of whole num- 
bers, mainly because scores are obtained as whole numbers. We should 
say that an obtained score of 41 is better than 99 per cent of the group 
can make, and a score of 18 is better than only the lowest 10 per cent 


can make. It should be noticed that every fractional score is rounded 
: thus 37.1 becomes 38. Since an obtained 


upward to the next whole number; 
score of 37 covers a range of 36.5 to 37.5, more than half of those making 


this score would mot be better than 95 per cent. The first score, counting 
from below upward, that is fotally better than 95 per cent is a score of 38. 
This is why, in this and in other cases in this table, we round upward to 


the next higher integer. 


A Graphic Profile Chart-—Many profile charts based upon centiles show 


graphically the deciles at equidistant levels along the scale. This gives 
an erroneous conception of the relative spacing of ability or talent, as 
was pointed out in a preceding paragraph. Actual differences in ability 
are probably more accurately indicated by the raw-score units than they 
are by centile-rank units, which relatively magnify the central portions 
of the distribution. If it is assumed that the actual distribution for the 
norm group is Gaussian, or normal, in shape, the relative spacing of the 
various centiles that we customarily include in our norms should be as 
given in Table 6.5. In the first column are the customary centile ranks. 
In the second column are the corresponding distances from the mean (and 


130 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 6.5.—Tue DISTANCE OF CENTILES FROM THE MEAN IN NUMBER OF STANDARD 
DEVIATIONS IN A NORMAL DISTRIBUTION 


Centile, Number of Sigmas 


rank Srom the Mean 
99 +2.33 
95 +1.64 
90 +1.28 
80 +0.84 
70 +0.52 
60 +0.25 
50 0.00 
40 0:25 
30 —0.52 
20 +0.84 
10 —1.28 
5 —1.64 
1 —2,33 


median) when the standard deviation of the distribution is 
convenience as the unit. The corresponding centile ranks an 
tances are also represented in F ig. 6.4. The correspondence 
from the mean with centile rank depends entirely upon the m 


adopted for 
d sigma dis- 
of deviation 
athematical 


Centile ranks 


10 20 30 40 50 60 70 80 90 95 99 
(| ia ae la a a ee a aa 


-------fun 


i | i 
i i I | 
1 1 I I 
i i | | 
| ! = 
H A H | 
2.00 -1O0e 0 +100 +2.00 
Standard-score scale 
Fic. 6.4. Showing on parallel scales standard scores and corres; 
Since standard scores are given equal spacing, centile ranks have 


ponding centile ranks? 
centile ranks been given equal spacing, standard scores would ha: 


unequal spacing. Had 
ve had unequal spacing. 


relations that hold true for the normal distribution curve, and the reasons 
for this need not concern us here. The author merely proposes to use 
this spacing of the centile ranks in setting up a profile chart and has done 
so in Fig. 6.5. 


CUMULATIVE DISTRIBUTIONS AND NORMS 131 


cally in the chart and need not be. Once having located them at the 
proper distances, we may forget the sigma values. 

Provision has been made for four tests in the profile chart, the memory 
test whose norms we have determined in previous parts of this chapter. 
a vocabulary test, a word-building j 
test, and a sentence-construction 
test, whose norms were determined 
elsewhere. For the memory test, 
the integral scores have been written 
in at their corresponding centiles, 
being guided by the list of score 
points in column (2) of Table 6.4. 
Once the scores nearest those points 
are located and written in the dia- 
gram, the other, intervening scores 
can be introduced. The same was 
true for the other test norms, though 
because of crowding, some integral 
scores have been omitted. The stu- 
dent whose profile is shown earned 
raw scores of 28, 88, 20, and 23, re- 
spectively, in the four tests. Those 
four scores have been encircled and 
then connected with straight lines to 
complete the profile. We can now 
see the general trend of this stu- 
dent’s ability in these four tests 
taken together, and we can read off 


i ti ing in each test at a = 
his centile rating h. Fic. 6.5.—An example of a profile chart 
glance. Furthermore, a muc morë based upon centile norms. Note that 


ion of his fluctua- the centile ranks are not spaced at 
accurate concept equidistant intervals but at intervals 


tion in ability is given than would based upon opondo sigma dis- 
p iagram with tances from the mean (see Table 6.5 and 
have been true in a diag Fig. 64). 


equidistant deciles. 
Figure 6.6 shows how, 


= 
BS 
SB 


nr 


D 
9 
8 
7 
6 
5 
4 
3 
2 
1 


5 


ow 


if we had spaced the centile ranks at equidistant 
intervals, as is sometimes done, the corresponding separations on the 
score scale would have been very unequal in different parts of the scale. 
As a general principle, individuals are best discriminated by tests where 
they are spread thinnest in the distribution. j f 

A Bar Diagram of Distributions of Scores —A useful graphic device for 
picturing distributions of scores is shown in Fig. 6.7.7 m bar Aii 

1 Similar diagrams have been used for some time by the Cooperative Test Service. 


132 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


there illustrate the distributions of three groups of students who were 
taught by three different instructors but who were given the same final 
examination, an objectively scored achievement examination in English. 


Tes? scores 


10 15 20 25 30 35 40 
7 ra ITẸ N V 
/ er jf a | \ % J N \ 
/ ff fF k’u ® we NM \ 
j d # 4 A Ke oS N, \, \ 
/ x Ff f ff 1% % k \ 
/ a ra Fd I s N 


0 10 20 30 40 50 60 70 80 90 100 
Centile ranks 
Fic. 6.6.—Showing parallel scales of centile ranks and corresponding raw scores on the 
memory test. Here centile ranks are equally spaced on their scale and raw scores are 


equally spaced on their scale. Unequal raw-score intervals correspond to equal centile- 
rank intervals. 


The median of each group is marked by a short horizontal line through 
the bar at the median-score level. The range of the middle 50 per cent 
(from P25 to P75, or from Q, to Qs) is shown in each case by the open 
rectangle. The black bars extend out to the points Pi) and Poy—in other 


200 


x«— Highest score 


5 o 
© o 


+ A 


S 
oO 


Scores in an English examination 
è s 8 8 


© 


Class A B Cc 


Fic. 6.7.—A graphic device for visual comparison of distributions, 


centile values and total ranges. showing important 


words, to include the middle 80 per cent of the cases. 

to points at Ps and Poss, or to include the middle 90 per 
The highest and lowest single scores are marked by 
several meaningful centile points 


The lines extend 
cent of the cases. 


the small «’s. Thus 
are labeled, as well as the entire range. 


CUMULATIVE DISTRIBUTIONS AND NORMS 133 


Interpretation of Bar Diagrams.—One important use of bar diagrams is 
the ready comparison of groups that they afford. In Fig. 6.7, for example, 
it is obvious that the three medians come in the order 1, 2, 3 for groups 
C, B, and A, respectively. The variabilities of the three groups come in 
the order B, C, and A when we depend upon total ranges. The groups 
come in almost the same rank order for variability when we compare 
ranges of middle 90 per cent, but again the order B, C, A is probably 
correct in comparing middle 50 per cents, though B and C are very close 
together in this respect. As to topmost scores, they come in the same 
order as for medians C, B, A, but for bottom scores the order is A, C, B. 
As to skewness, the most symmetrical distribution, all things considered, 
is probably that for group B, and the least symmetrical is for group 4, 
which is positively skewed. The special virtue of this kind of compari- 
son, as contrasted with that afforded by means of frequency polygons 
and ogives, is that many more facts about a distribution can be recorded, 
and yet because of no overlapping of the drawings there is direct com- 
parison without confusions. 

Exercises 


1. Carry through the following steps for the first distribution of chemistry-aptitude 
scores in Data 3C (Ch. 3). 
a. Find the cumulative frequencies, and tabulate them. 
b. Plot a cumulative distribution curve similar to Fig. 6.1. 
c. Find the cumulative percentages and proportions, and tabulate them. 
d. Plot the ogive distribution, showing the smoothed curve. 
e. Compute the interpolated centiles that divide the distribution into tenths. 
f. Derive centile norms from the smoothed ogive, and set up a table of norms. 


g. Prepare a centile profile chart including the norms for this test and for one or 


two others for which you have data. 
2. Repeat the steps, particularly a, ¢, d, and f, for any other distribution of test scores. 
3. Prepare bar diagrams like those in Fig. 6.7 for comparing two or more distributions, 


such as the two in Data 3C, or Data 4F (Ch. 4). 


CHAPTER 7 
THE NORMAL DISTRIBUTION CURVE 


Repeatedly have sets of measurements in psychology and education 
yielded frequency distributions that resemble the bell-shaped normal, or 
Gaussian, curve. Because the normal curve has so many useful mathe- 
matical properties, it is quite natural that we should exploit those proper- 
ties in dealing with psychological and educational data. Without the use 
of the Gaussian curve and its convenient characteristics, many things that 
we now do with data would otherwise be impossible. It is important, 
therefore, that the student develop at least a moderate understanding of 
the normal curve in order that he may wisely apply the statistical pro- 
cedures that depend upon it. 

Normality of Distribution Is Assumed.—It must be confessed at the 
outset that no set of data ever obtained, whether they be measurements 
of a group of individuals with respect to some biological, psychological, 
social, or educational trait or whether they be repeated observations of a 
single phenomenon, ever conforms exactly to the normal distribution 
pattern. Even though the larger population from which our sample 
came is perfectly normally distributed (even this is probably never strictly 
true), sampling, no matter how extensive or representative it may be, 
is bound to give us some irregularities, with deviations from the normal 
form. Whenever, therefore, we treat our data as if they were normally 
distributed, or arose from a population that is normally distributed, we 
are assuming an ideal pattern for the sake of simplicity, rationality, and 
convenience. Sometimes we are more justified and sometimes less; we 
can never be absolutely sure, because the entire population is rarely or 
never measured, and the true shape of distribution is never known. 

We can justify our assumption of normality in several ways. One is 


the rational approach, which attempts to point out that the phenomenon 
we are measuring results from a number of independent causes occurring 
in chance combination, as in the tossing of coins or in the combinations 
of nonlinked hereditary genes. Very rarely is this kind of argument possi- 
ble because of our ignorance of underlying causes. Another kind of 
approach is empirical, in which we can show that, with the use of the 
measuring scale that we did use, the grouped data present a frequency 

134 


THE NORMAL DISTRIBUTION CURVE 135 


distribution that obviously possesses a bell-shaped contour. Further- 
more, there are statistical tests that can be applied to show whether or 
not the frequencies we obtained deviate so much from the normal-curve 
picture as to cause us to reject our hypothesis that the data came by ran- 
dom sampling from a normally distributed population. 

Two Reasons for Caution.—There are two considerations, however, 
which should cause us to pause before making the hypothesis or assump- 
tion of normality. One has to do with the question of sampling and the 
other with the question of the correctness of our measuring scale. A 
population may well be normally distributed, yet because of our method 
of drawing cases for measurement we may obtain a skewed or otherwise 
distorted form of distribution. This is a case of biased sampling. A 
large population of ten-year-old children would probably be distributed 

_ normally when measured for mental age. But if we confine ourselves to 
ten-year-old children in the fourth grade only, where most ten-year-olds 
are probably present because of mental retardation and a few for other 
reasons, the distribution of mental ages would be positively skewed. The 
ten-year-olds in the sixth grade would probably yield a negatively skewed 
distribution, for the majority of them are accelerated by reason of pre- 
cocity and a few for other causes. Both are cases of biased sampling. 
An unbiased, representative sampling would not confine itself to fifth- 
grade children, but would take ten-year-olds in correct ratios from all 
grades where they appear, would take them in correct proportions as to 
sex, economic status, and other factors considered significant. 

When a test or examination is used as the measuring instrument, the 
form of distribution of scores will depend upon many factors other than 
the form of distribution of the population. One of these factors is the 
level of difficulty of the test relative to the level of ability of the popu- 
lation. Even if the population is normally distributed in the ability 
measured, unless the test is of an appropriate level of difficulty a normal 
distribution of scores in a sample will not be obtained. Ifthe testis tao. 
difficult, the distribution will be positively skewed, like that labeled Ain 
Fig. 71. Tf the test is of moderate difficulty for the group, a symmetrical 
distribution like that labeled B will occur. If the test is too easy for the 
group examined, the distribution will be negatively skewed, like C_in 
Fig. 7.1. Other degrees of skewing might occfir. The effect of skewing, 


when we are sure that the correct form of distribution should be sym- 
metrical, may be regarded as a systematic distortion of the scale of 
measurement. The too difficult test tends to make the numerical units 
among the low scores stand for relatively large intervals of ability, and 
the too easy test to make the units among the high scores also stand for 


136 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


relatively large intervals. This principle should be clear from a study of 
Fig. 7.1. 

Other factors than difficulty may distort sample distributions. Later 
(Ch. 17) it will be shown how degree of reliability of scores may affect 
the form of distribution, causing tendencies toward sharpness of the rise 
in the center versus flatness, tendencies toward bimodality, and even 
U-shaped distributions. Another distorting factor may be the unsuit- 
ability of the scale. As was pointed out in an earlier chapter (Ch. 4), 
work-limit scores and time-limit scores tend to be reciprocals of each 
other. If the one kind of score in a task is normally distributed, the 
other will probably not be. 

These cautions kept in mind should serve to inhibit dogmatic assertions 
that might otherwise be made about the shape of a distribution. The 
shape of a distribution is always a function of the kind of measuring scale, 


A 8 C 


PA Žž 


Fic. UF Soning how a test ge three eret levels of difficulty may yield distribu- 
tions of raw scores differing markedly in skewness regardless of the fi distributi 
Millie A g e form of distribution of 


and all conclusions that involve form of distribution should take this fact 
into account. The conviction that general populations are genuinely 
normally distributed with respect to most qualities is very strong, how- 
ever; so it is usually the marked deviation from normality in a safiple 
that arouses questions. We may then question either our method of 
sampling or our measuring scale. One or both of these factors may be 
responsible for the discrepancy. But when our sample distribution turns 
out reasonably normal in appearance, because of the conviction just men- 
tioned we may feel some assurance that our sampling and our measuring 
scale are probably free from distortions, though of course we can never 
be certain of this. The conviction does lead us to apply the Gaussian 
curve in many useful ways, even in turning crude judgments into scaled 
measurements, as we shall see later (Ch. 19). We frequently feel that 
the risk in making the normal assumption is well worth while because of 
the invaluable results and conclusions it affords. We can always state 
our conclusions with the reservation that they are true to the extent that 
our assumptions are valid. As a matter of fact, all other conclusions 


THE NORMAL DISTRIBUTION CURVE 137 


should be couched in similar terms, for none is without its foundation of 
assumptions of one kind or another, whether stated or not. All scientific 
conclusions rest on assumptions, in the final analysis, and he who would 
know the import of those conclusions best is the one who knows those 
assumptions best. 


THE NATURE OF THE NORMAL CURVE 


The Relation of the Normal Curve to Probability—The Gaussian curve 
is also sometimes called the xormal probability curve and is said to be the 
result of the “laws of chance.” Ina sense, this is true. We cannot here 
go into an involved discussion of probability and of the way in which the 
Gaussian curve is logically related to probability. It is sufficient for our 
present purposes to point out the usual example of how a normal dis- 
tribution can be approximated by means of coin tossing. If we thoroughly 
shake a set of 6 coins and toss them to land where and how they may, the 
result can turn out in seven different ways; the number of heads can vary 
all the way from 0 to 6. Ina total of 64 tossings, according to the prin- 
ciples of probability, we should expect the following frequencies for various 


numbers of heads: 


an 


15 


If we tossed the 6 coins twice as many times, we should expect these fre- 
quencies to be doubled. Actually obtained frequencies will deviate from 
these expected ones by small amounts. In one such experiment with 
128 tosses, the obtained frequencies were as given here: 


Heads sissies ina qecinees 
Obtained frequencies. 
Expected frequencies. ....+++++++++ 


This situation is shown graphically in Fig. 7.2, where the obtained fre- 
quencies furnish the basis for the histogram and the expected frequencies 
furnish the basis for the superimposed normal curve. 

A 6-coin problem gives us a 7-sided frequency polygon (not counting 
the base line). A 10-coin problem gives us an 11-sided contour, etc., the 
number of sides being equal to the number of coins plus 1. If we do not 
ase line of our distribution but keep subdividing it into 
we increase the number of coins, the contour 
he smooth bell form. The num- 


enlarge the b 
smaller and smaller units as 
of the distribution curve approaches t 


138 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ber of class intervals we choose in grouping obtained measurements has 
nothing to do with the number of coins, our choice being entirely arbitrary. 
The class intervals and their frequencies merely give us descriptions of 
the contour at points along the way. If there are things like coins in the 
phenomenon we are measuring (#.c., “coins” such as genes, which may be 
present or absent, or such as responses that do or do not occur) we almost 
always lack information as to how many such “coins” are operating. 
Probably there are a great many, although even if there were only 6, 
as in the coin example, and if our measurements naturally fell therefore 


50 


Se SS 


Frequencies 


5 


0 
0 | 2 3 4 5 6 
Heads 
Fic. 7.2.—A distribution curve representing the frequencies with which various num- 


bers of heads are expected by chance in tossing six coins; also, in histogram form, the 
obtained distribution from 128 tossings. 


into seven class intervals, the normal distribution could still be roughly 
approached, as can be seen in Fig. 7.2. 


The Equation for the Normal Curve.—Mathematically, when we are 
dealing with the properties of the normal curve, it is the situation with 
an infinite number of “coins” that we suppose. This enables the mathe- 
matician to give to the curve an equation that describes the relationship 
of a frequency to its corresponding measurement. This equation reads 


=x 

= _@ 
o Vir 

where Y = frequency. 


N = number of measurements. 


ø = standard deviation of the distribution. 
aw = 3.1416. 


Y 


(Equation for the Gaussian or normal curve) GI) 


mee 


THE NORMAL DISTRIBUTION CURVE 139 


e = 2.718 (the base of the Naperian system of logarithms). 

x = deviation of a measurement from the mean (or XY — M). 
Since the values for 7 and e are known, if we substitute them in the 
equation, it becomes 


-x 


N _ 9.74920 


¥ = 9 506G; 


For any distribution we may have at hand, we know the values for V 
and for a, and these can be inserted in their places in the equation. The 
equation would then be in a form with only Y and « the unknowns. We 
could then assign certain values to x, within the range of our measure- 
ments, and then solve the equation for the corresponding values of Y. 
In this way, we could determine the entire normal distribution curve that 
best fits our data. The arithmetical work would be rather laborious. 
Fortunately, we have the use of statistical tables to aid us in this. 
Table B, in Appendix B, is one well suited to this purpose. 

Determining the Best-fitting Normal Distribution for a Set of Data.— 
For the sake of an illustration that will help us to appreciate the meaning 
of the normal curve, let us find the expected frequencies in a particular 
instance, a distribution of 86 scores in a memory test. The best-fitting 
normal curve for any set of data has the same mean and standard devi- 
ation as those computed from the actual data. The distribution of 
obtained frequencies of memory-test scores is given in column (7) of 
Table 7.1. The mean of this distribution is 26.1, and the standard devi- 
ation is 6.45. Our task is to find the frequencies to be expected in the 
same class intervals for a normal distribution with a mean of 26.1, a stand- 
ard deviation of 6.45, and an N of 86. 

Standard Measurements or Scores—In order to use equation (7.1) to 
find these frequencies, we must know how far each class interval deviates 
from the mean in terms of standard deviations. Each interval is given 
the value of its midpoint as its point on the score scale X. These X 
values are listed in column (2) of Table 7.1. Note that we have included 
one class interval beyond the range of obtained scores at each end of the 
distribution. This is because the best-fitting normal curve usually has 
some small frequencies (perhaps fractional) in those extreme positions, 
even though the obtained frequencies there are zero. The equation for 
the normal curve calls for deviations rather than original scores—in other 
words, for X — M, or small «, for each class interval. These are listed 
in column (3). In this problem, each one is found by the solution of 
X — 26.1 for every interval. A simple check is to see that each cne is 
three units (the size of the interval) distant from its immediate neighbors. 


140 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 7.1.—OBTAINING THE EXPECTED FREQUENCIES fe IN THE CLASS INTERVALS 
FOR THE MEMORY TEST, ON THE ASSUMPTION THAT THE TRUE DISTRIBUTION 


Is NORMAL 
(2) (3) (4) (3) (6) (7) 
x “ z y f Ío 
Egat D i ü Standard From Expected | Observed 
Midpoint Svan score Table B | frequency | frequency 
45 | +18.9 2.93 -0055 0.2 0 
42 +15.9 2.47 -0189 0.8 1 
39 +12.9 2.00 -0540 2.2 4 
36 + 9.9 1,53 - 1238 5.0 5 
33 + 6.9 1.07 +2251 9.0 8 
30 + 3.9 0.60 -3332 13.3 14 
27 + 0.9 0.14 .3951 5.8 t7 
24 = 2.1 —0.33 .3778 15.1 9 
21 - 5.1 —0.79 . 2920 LF 13 
18 = 8.1 —1.26 - 1804 7.2 8 
15 11.1 -1.72 -0909 3.6 3 
12 —14.1 —2.19 -0363 1.5 4 
9 —17.1 —2.65 -0119 0.5 0 
j 85.9 86.0 
Each column of numbers is derived from the one preceding by the following com- 


putations (see text for explanations): 
Column (3): x = X — 26.1. 
Column (4): z = x/6.45. 
Column (5): y comes from Table B. 
Column (6): fe = 40 X y. 


The next step involves a new process; the determination of the standard 


measurement or standard score, for every interval. The standard score is 
given by the formula 


pat X-M 
See (A standard score or measure) (7.2) 


In the equation for the normal curve, it will be 
of e, which is —x*/2¢%, can be written —(14)(«/c)?, or in other words, 
it is 14 times the standard score squared. We shall find the standard 
score invaluable again and again. The statistical tables are constructed 
on the basis of standard scores. It matters not, then, what our original 
means and standard deviations are numerically. Reducing all raw scores 
to standard scores places them all on the same basis or common denomi- 


seen that the exponent 


a 


THE NORMAL DISTRIBUTION CURVE 141 


nator. For our illustrative problem, the standard scores are given in 
column (4) of Table 7.1. Each number in column (4) is obtained by 
dividing the corresponding number in column (3) by 6.45, the standard 
deviation. 

Determining Frequencies for the Class Intervals —Having obtained the 
standard score for each class interval, we are now ready to Jook up the 
corresponding ordinate in the general statistical table, Table B. These 
` are listed in column (5) of the work table. The ordinates in this table 
are not exactly the frequencies we have been wanting to find. Those 
frequencies also depend upon V [see equation (7.1)]. Table B is con- 
structed on the assumption that V = 1, ande = 1. For our distribution 
of 86 cases and a different ø, we must make a certain adjustment. We 
must multiply each y value by a certain number to find the expected 
frequency fe The general formula is 


_ [in (Expected frequency in a best-fittin 
fe = (*) y normal distribution) 5 (7.3) 


In this problem, 


iN _ 3X86 _ 258 
c 645 6.45 

When this multiplier is used with the numbers in column (5), the frequen- 
cies we desired are finally forthcoming, and they are given in column (6). 
Formula (7.3) may be made to appear reasonable if we look at it in 
the following manner. The expected frequencies (fe) must be of the order 
of magnitude of the obtained frequencies (fo). The sum of the obtained 
frequencies is, of course, equal to N. The expected frequencies are, 
therefore, proportional to N, as formula (7.3) states. They must also 
be proportional to the size of class interval (7) because the Jarger the size 
of interval, the smaller the number of them, and, since they add up to N, 
the larger each frequency is. ‘The appearance of o in the denominator is 
not quite so easily explained. It is best explained when we consider the 
equation for the normal curve. Ignoring the expression involving e (with 
its exponent) in equation (7.1), we find that Y is proportional to N/ov/2n. 
When we let both V and o equal 1, as is the case in the tables on the 
normal curve, y is proportional to 1 /\/2r. From this we see that the 
ratio of Y to y is equal to N/c. Thus, from another approach we can 
account for the presence of ø in formula (7.3) as well as the presence of V. 
Comparing Obtained and Theoretical Frequencies —As a rough check 
upon all the work, we sum the expected frequencies, and the result should 
be very close to V but will usually be slightly less than N, because in 


142 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the normal curve there are still fractions of frequencies even beyond the 
limits we have included here. Had we not gone one class interval beyond 
the obtained data, we should have lost .2 of a frequency at the upper 
end and .5 at the lower, and the sum would have been 85.2 instead of 85.9. 
As it is, we have still lacking only .1 of a case; not enough to worry about, 
and we may accept our check as one indication of correct work. A com- 
parison of expected with obtained frequencies is always a rough check 
but is very rough, because we expect small discrepancies within class 
intervals. Looking down the columns, we find only one or two serious 
discrepancies. One is the difference between 15.1 and 9, and the other is 
between 1.5 and 4. Both the obtained frequencies of 9 and 4 are out of 
line but are probably merely chance discrepancies, coming under the 
heading “errors of sampling,” and are no more serious than may be 
expected in a coin-tossing experiment. 

Plotting the Best-fitting Normal Curve.—We could now use the expected 
frequencies as the basis of plotting the best-fitting, smooth, normal dis- 
tribution curve for the memory-test data. If plotting such a curve is our 
only objective, however, we have done some unnecessary work. A shorter 
procedure for locating enough points for drawing the smooth best-fitting 
curve will now be explained. It follows precisely the same principles laid 
down in the previous discussion. But instead of being tied down to class 
intervals and their midpoints for our x values, we instead arbitrarily 
choose standard scores at convenient values 5o apart, as in the first 
column of Table 7.2. Since they are simple numbers, no interpolation 
will be necessary in using Table B. Since the positive standard scores 
duplicate the negative ones, half the work of looking up y values is obvi- 


ated, unless one wishes to repeat the process as a check. The expected 
frequencies are again found by multiplying y by iN/c, in this case, by 40. 
As before, this step is for the sake of obtaining frequencies in the pro- 
portions comparable with those obtained for a particular V (86), a par- 
ticular o (6.45), and a particular size of class interval (3). 

The frequencies found in this manner will n 


of class intervals, however, but to other score 
These points will be .5ø apart, starting at the mean and going both ways. 
They correspond to the z scores given in the first column of Table 7.2. We 
need to find the corresponding X values for these z values. The first step 


ot correspond to midpoints 
-point positions on the scale. 


1 The customary way of determining whether the discri 
and obtained frequencies are so large as not to be attrib: 
employ the chi-square test (see Ch. 11). The chi-square t 
curve hypothesis, enables us to arrive at a decision as to th 
set of frequencies is not normally distributed. 


epancies between theoretical 
uted to sampling errors is to 
est, as applied to the normal- 
€ probability that an obtained 


THE NORMAL DISTRIBUTION CURVE 143 


TABLE 7.2.—OBTAINING THE BEst-FITTING NORMAL CURVE FOR THE DATA ON THE 
Memory TEST FOR THE PURPOSE OF PLOTTING THE CURVE 


(1) (2) (3) (4) (5) 
š fe 
3 Bs æ x 
z Expected 
Standard score | From Table B Pa Deviation | Raw score 
+3.0 0044 0.2 +19.4 45.5 
+2.5 .0175 0.7 +16.1 42.2 
+2.0 0540 2.2 +12.9 39.0 
+1.5 -1295 5.2 + 9.7 35.8 
+1.0 . 2420 9.7 + 6.4 32.5 
+0.5 23521 14.1 + 3.2 29.3 
0.0 . 3989 16.0 0.0 26.1 
—0.5 .3521 14.1 — 3.2 22.9 
—1.0 .2420 9.7 — 6.4 19.7 
—1.5 .1295 Seo — 9.7 16.4 
—2.0 0540 252 —12.9 13.2 
—2.5 .0175 0:7 —16.1 10.0 
—3.0 0044 0.2 —19.4 6.7 
—* 
The numbers in the columns are obtained as follows: 
Column (1): Arbitrarily chosen. 
Column (3): 40 X y. 
Column (4): 6.45 X z. 
Column (5): x + 26.1. 
is to find the corresponding x deviations by the formula 
x = so (A deviation derived from a standard score) (7.4) 


These are shown in column (4) of Table 7.2. The X points corresponding 
to x deviations can be found by the formula 


X=M+2% (A measurement estimated from a deviation) (7.5) 


which, in this problem is X =26.1 +x. The X values we want are 
n the last column of Table tds 

core points and their corresponding frequencies, we can 
construct the graph shown in Fig. 7.3. The observed frequencies (fo) are 
also plotted as circlets to show where they fall with respect to the best- 
fitting normal curve. The reasonableness of the fit is rather obvious. 
It would probably have been not so easy to duplicate this normal curve 
by the smoothing process recommended in Ch. 3. We may say by way 


shown i 
Having these s 


144 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of general conclusion that if our obtained mean and standard deviation 
approximate closely the mean and sigma of the population from which 
our sample came, and if the distribution for the population is normal, 
it looks like the curve in Fig. 7.3. 


Frequencies 


oo E o s o s o s 


Fic. 7.3.—The best-fitting normal distribution curve for the memory-test data. Ob- 
tained frequencies are represented by circlets. The normal curve is “ best-fitting "in the 
sense that it has the same mean and standard deviation as the obtained distribution. 


AREAS UNDER THE NORMAL CURVE 


Perhaps the greatest usefulness of the normal curve lies in the relation- 
ship of the amount of area under the curve lying between certain limits 
on the base line. In terms of mental-test scores, for example, this simply 
means the number or percentage of the cases to be expected between two 
score points. This is because the area under the curve represents the 
number or percentage of cases. The total area is equal to N, the żotal 
number of cases. But if we think in terms of a standard curve where 
N = 100, we can readily deal with percentages. For example, 50 per 
cent of the surface lies above the mean and 50 per cent below. We can 
also think in terms of a standard curve whose total surface is equal to 1, 
or unity. In this instance we deal with proportions. The proportion of 
the area, or cases, lying above the mean is .5 and the proportion below is .5. 
The statistical tables are given in terms of a total area of 1, and the areas 
of certain segments are listed as proportions, but it is just as easy to talk 
in terms of percentages. A percentage is a proportion multiplied by 100, 
and a proportion is a percentage divided by 100. Thus .46 of the surface 
is 46 per cent; and 72 per cent of the cases is .72 of the surface, etc. 

Proportion of the Area between the Mean and Some Measurement or 
Score.—We have already had occasion to say that the interval extending 
one standard deviation on either side of the mean includes about two- 
thirds of the cases. To say the same thing in another way, from the 
mean to plus 1ç are to be expected about one-third of the cases, and from 


THE NORMAL DISTRIBUTION CURVE 145 


the mean to minus 1c, another one-third of the cases. We can verify this 
by referring to Table B and looking up the proportion of the area between 
the mean and Io (i.e., a z equal to 1.00). The area given to four decimal 
places is .3413, or three thousand four hundred thirteen ten-thousandths 
of the area. If there were a normal distribution with 10,000 cases, 3,413 
of them would be expected between the mean and 1o. In terms of per- 
centage, it would be 34.13 per cent, or 34.13 cases in 100. The total 
interval from +1ø to —1e contains twice this area, or .6826, or 68.26 per 
cent. Figure 7.4 illustrates these facts graphically. We now see that 
this is a little more than two-thirds (which would be 66.67 per cent), but 
with small deviations from normality occurring on every hand we can 
afford to be so rough with our expectations as to give it as two-thirds. 


=] 
-30 “20 “lo 0 +o Ho | Bo 
Fic. 7.4.—Different percentages of area under the normal curve within the various 


1-sigma units on the base line. 


From Table B, we can also see that between the mean and a point 2e 
distant (either above or below, i.e., either -+-2c or —2c), we should expect 
.4772 of the total surface, or 47.72 per cent of the cases. Included in the 
range from —2¢ to +2c, we should find twice this proportion, or .9544 of 
the area, or 95.44 per cent of the cases. Out to 3c from the mean extends 
.4987 of the area, and in both directions from the mean to 3o we find 
twice this, or .9974 of the area. Only 26 cases in 10,000 (10,000 — 9,974), 
therefore, should be expected beyond the range from —3o to +3o in a 
large sample. 

To take another example of a less special nature, how much of the area 
under the normal curve will be found between the mean and +0.78c? 
From the table, we find this to be .2823. In still another problem, how 
many cases lie between the mean and —1.470? From the table, we find 
this to be .4292. Figure 7.5 illustrates these two cases. It will be seen 
that the positive or negative sign of z merely tells us whether the area 
extends above the mean or below. The numerical size of z, whether 
positive or negative, determines the amount of area between the mean 


and the point. 


146 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


So far we have begun each problem of this type with some particular z 
or standard measurement. Let us start the problem a step or two fur- 
ther back and begin with some raw score or measurement. In the more 
practical case, we begin with X, not z. In the memory-test data, we may 
inquire what proportion of the cases come between the mean (26.1) anda 
point of 35 on the scale of measurement. This point deviates 8.9 units 


“ie 0 HOB 
Fic. 7.5.—Proportions of the total area under the normal curve within certain standard- 
score limits on the base line. 
from the mean (X — M = +8.9). This is the deviation x. The stand- 
ard score z is x/o, which equals 8.9/6.45 = +1.38. Everything must be 
transformed into standard measure before the probability table may be utilized. 
Entering the table with a z of 1.38, we find the corresponding area to be 
4162. In other words, 41.62 per cent of the cases in a normal distribu- 
tion would be found between the mean and 35 points on the scale. In 
the memory-test data, 41.62 per cent of 86 is 35.8, or, in whole numbers, 


1 BI 3045 Score scale 
=i: 40,6745 
Fic. 7.6.—Proportions of the casi Nata 


es to be expected between certain score limits in the 


memory-test data, on the assumption that the distribution is normal. 


36 cases. Ina similar manner, which the student should verify, between 


the mean and a score of 20 are .3276 of the 
Between the mean and 15 are about 39 case 
down to a score point of 5, we find 49.95 per cent of the cases, 

Special interest attaches to the question of the proportion of cases 
between the mean and a score of 30.45. It will be found that the stand- 
ard score corresponding to this is 0.6745. From the table we find that 
the proportion of the area to this point is .25, or exactly one-fourth. 


cases, or approximately 28. 
s of the 86, and if we go on 


$2 = an 
E ŘŘ——_ 
er — ———< i 


THE NORMAL DISTRIBUTION CURVE 147 


This case is illustrated in Fig. 7.6. In short, the point at 0.6745¢ corre- 
sponds to a distance of 1Q from the mean. 

The Area above or below a Certain Point on the Scale.—For a given 
deviate or standard score, Table B also gives us the proportion of the 
areas above a certain point on the scale or below it. Above a point at 
+10 will be found .1587 of the area. This is found in column (C) of 
Table B, because when a vertical line is erected at +10 (see Fig. 7.7) 
it divides the total area under the curve into two portions, the one above 
the line being the smaller of the two. Below the point +10 is the 
remainder of the area, or the larger portion [found in column (B) of the 
table], including .8413, or 84.13 per cent of the area. If we were inter- 
ested in the point —1ø, the larger portion under the curve is now above 
the point of division and is found in column (B), whereas the portion 


Fic. 7.7.—Proportions of the area above and below the standard score of +10 and under 
the normal curve. 


below, being the smaller of the two, is found in column (C). The situ- 
ation is just reversed: to the case where the division comes at +1c. It is 
necessary to keep in mind in this kind of problem whether the area we 
wish to know is under the smaller end of the curve, all on one side of the 
mean, or whether it is under the larger side of the curve extending across ` 
the mean. 

The proportion of the area above the point at +0.78¢ is in the smaller 
portion, and found in column (C), it is .2177. The area below —1.470 is 
also under the smaller portion of the curve, and from column (C), we find 
that it is .0708 (see Fig. 7.5). The area above the point — 1.47s would be 
equal to 1.0 — .0708, which is 9292. Or it can be found from column (B), 
since it occupies the larger portion under the curve, and this also gives us 
.9292. Or, from Fig. 7.5, we can see that it is the sum of the area from 
the point to the’ mean (.4292) plus .500, which gives the same result. 

In the memory-test data, where the mean is 26.1 and ø is 6.45, we may 
ask for the percentage of the cases to be expected below a score of 15. 
The deviation from the mean is 11.1. When this is divided by 6.45, we 
find that the z-score is — 1.72. Corresponding to a z of —1.72 is an area 


148 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of .0427 in the tail of the normal curve (see Fig. 7.6). We may expect 
4.27 per cent of the cases below a score of 15; or, out of 86, this would be 
3.7 cases. Above a score of 15, we should expect the remainder of the 
cases, naturally; ¿.e., a proportion of .9573, a percentage of 95.27, and in 
number of cases, 82.3. Above a score of 30.45, which corresponds to a 
z score of +-0.6745, we should expect 25 per cent of the cases. 

Area between Two Points on the Scale.—The first case of this kind of 
problem has already been mentioned when we asked for the proportion 
of the area between —1o and +16 and the like. When the two score 
points are on two sides of the mean, it is simply a matter of summing 
the two areas between the mean and the two points. For example, 
between the points —1.47¢ and +0.78c, we have the two areas .4292 and 
-2823 to add (see Fig. 7.5). The result is -7115, or 71.15 per cent. 


A AEN h Zghest 10% 
owest 20 %4 2 
l Vz 
“084160 0 +1.28160 Standard scale 
207 26.1 34.35 Score scale 


Fic. 7.8.—Score 


§ points above or below which certain percentages of the cases are ex- 
pected in th 


e memory-test distribution, assuming normality of distribution. 


When the two points lie on the same side of the mean, it is a matter of 
subtracting the smaller area from the larger, more inclusive area, For 
example, the area between points at +10 and +2¢ can be found by first 
obtaining from the table the area from the mean to +1o¢ (which is .3413) 
and the area from the mean to --2c (which is 4772). The area we seek is 
-4772 — 3413 = 1359 (see Fig. 7.4). The area between points —2¢ and 
—3c would be the area .4987 [from Table B, column (A)] minus .4772 
(from the Same source), The difference is equal to .0215, which is illus- 
trated in Fig. 7.4, 

The area between two raw- 
nation of z scores as the first 
scores 10 and 20, which corre: 
respectively, the area is the di 
-1662, or 16.62 per cent. Th 
are found as usual in Table B, 


score points again involves the determi- 
step. In the memory-test data, between 
spond to z scores of —2.50 and —0.945, 
fierence between .4938 and .3276, which is 
e areas from the mean to the two z scores 
As one more example from the same data, 


THE NORMAL DISTRIBUTION CURVE 149 


mean in the two cases .2274 and .4162. The student should verify these 
estimates. 

Points above or below Which Certain Proportions of the Cases Fall.— 
The next problems reverse the processes that have just been described. 
Before, we were given points on the scale of measurement to determine 
areas; now we are given areas from which to determine points on the scale. 
For example, above what point in the normal curve does the highest 10 per 
cent of the cases come? Ten per cent is a proportion of .10. We could 
now use Table B in reverse, but it is much more convenient to utilize 
Table C, which gives the proportions in even steps. We are faced with a 
problem that gives the proportion in the tail of the curve; so we look in 
the last column for C, the smaller area. We find the z score correspond- 
ing to it to be 1.2816. This will be with plus sign, since we are talking 
about the highest 10 per cent (see Fig. 7.8). Had we asked below what 
point does the lowest 10 per cent fall, the answer would have been 
—1.2816c. If the question is, “Above what score lies the highest 80 per 
cent of the cases?” we are then dealing with the larger proportion under 
the curve; so we look for the proportion of .80 in the first column of 
Table C. The corresponding z score is —0.84160 (see Fig. 7.8). Had 
we asked for the point below which is the lowest 80 per cent, the answer 
would have been +-0.8416. 

To apply these same questions to the memory-test data, we need go a 
step further and transform the z scores into terms of the raw-score scale. 
The highest 10 per cent come above a z of +1.2816. Multiplying this 
by o (which is 6.45), we obtain the deviation (x) of +8.27. The mean 
(or 26.1) plus 8.27 gives us a score of 34.37 points. The highest 10 per 
cent in a normal curve with mean of 26.1 and sigma of 6.45 would come 
above the point 34.37. It happens that this point comes close to the 
division point between two class intervals, or 34.5. In the actual dis- 
tribution (see Table 7.1), 10 cases, or close to 12 per cent, were scores of 
35 or above, which is good agreement. Ten per cent would have called 
for 8.6 cases, or 9 in whole numbers. : 

The highest 80 per cent of the cases, which we found to come above a 
z score of —0.8416c, will be expected above a raw score of what? The 
deviation of this point from the mean is —5.43 points, or a score of 20.67. 
This comes close to another division point between class intervals, namely, 
20.5. In the actual distribution, 71, or 82.5 per cent, of the cases are 
above a score of 20.5. Again the agreement between obtained propor- 
tion and expected proportion is quite close. To take one more cane, 
which gives a point exactly between class intervals, we ask above what 
point are 93.2 per cent of the cases? The point turns out to be a score of 


150 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


16.5 points (the student should verify this). The actual percentage of 
cases above this score point is 92—again a very close agreement. 

Centiles and Corresponding z Scores.—By now it may be apparent 
that we can look up in the tables the s score corresponding to any given 
centile. For example, foo is the point below which are 90 per cent of the 
cases. Entering Table C with .90 in column (B), we find the correspond- 
ing z to be +1.2816. Corresponding to psp is the z score of +0.8416. 
We could find the corresponding raw-score points corresponding to all 
these z scores for any particular distribution. If the assumption of 
normal distribution is valid, this procedure would be an advance step 
over the recommendation of smoothed ogives for setting up centile norms. 
But if there is any noticeable skewing in the distribution, this procedure 
would be rather questionable. The smoothed-ogive method would leave 
the actual skewness taken into account. Since further measurements 
with the same test will probably yield the same kind of distribution from 
the same population, this deviation from normality should be represented 
in the norms. 

It can now be explained how, earlier (see Table 6.5), we arrived at the 
spacing of centile scores on the profile chart (Fig. 6.5). The values given 
to represent the spacing of the centiles are the z scores corresponding to 
them, and they were obtained as was explained in the preceding para- 
graph. The result is to normalize the distribution of all tests, whether | 
the original measuring scale gave a normal distribution or not. There is, 
in other words, a general underlying assumption of normal distribution of 
the population in all the abilities represented in the profile chart. The 
most important gain in so doing is to transform measurements of all 
abilities into the terms of a common intelligible scale. 

The Points between Which Lie Certain Proportions of the Middle 
Cases.—Among the problems involving area under the curve, there 
remains the case in which, given the area of a central group, what are 
the score limits of that group? The only practical case here occurs when 
the central group is evenly balanced on either side of the mean; the middle 
50 per cent, 80 per cent, or 90 per cent. Those groups, it will be remem- 
bered, are Significant in connection with indicators of variability and are 
given distinction in the gtaphic device illustrated in Fig. 6.7. Here, 
however, we are talking about the best-fitting normal curve and not the 
original distribution. | The middle 50 per cent extends from Q, to Qs, or 
from pzs to prs. Going to the tables with a proportion of .75, we find 
the corresponding z to be, as we should expect, 0.67450. The two points 

bounding this middle 50 per cent are —0.6745 and +-0.6745. In the dis- 


THE NORMAL DISTRIBUTION CURVE 151 


tribution of memory-test scores, these points would correspond to actual 
scores of 21.75 and 30.45. The interpolated Q, and Q; in this same 
obtained distribution were 21.00 and 30.85, respectively, or not very far 
from those estimated in the best-fitting curve. The middle 80 per cent 
extends from fıo to poo. We have previously determined these to beata 
distance of 1.2816c, minus and plus. The corresponding raw scores are 
17.83 and 34.37. The interpolated 10th and 90th centiles are 17.1 and 
35.3, again in close agreement. This kind of problem has really little 
application in psychological and educational statistics, but is included for 
the sake of completeness and with the hope that it may lend further 
insight into the several ramifications of the normal distribution curve. 
All other problems having to do with area illustrated above do have 
numerous and valuable applications, some of which we shall meet in 
Ch. 12. 
Exercises 


1. a. Toss six pennies 64 times. After each throw, note and record the number of 
heads. Compare your obtained frequencies with the expected frequencies. 
Plot frequency polygons of the two distributions. Compute the mean and 
standard deviation of the distribution. 

b. Toss the same six pennies 64 times more, obtaining a new, set of data like the 
first. Compute the mean and standard deviation of this distribution, and make 
comparisons with the first obtained distribution and with the theoretical 
distribution. 

c. Combine the two distributions into a single one. Are the frequencies now 
any nearer the expected ones? Compute the mean and standard deviation. 
Are they any nearer the mean and standard deviation of the theoretical 
distribution? 

d. One more experiment may be tried in which some of the outcomes with a 
small number of heads are not counted, but another throw is immediately 
substituted. Every second case in which at a glance you can tell the number 
of heads is small, should be ignored and the trial repeated. Again, obtain 64 
record trials. This situation illustrates a biased sampling. What is the effect 
upon the frequencies? 

e. What would happen in another set of trials if one penny were left head up, 
only the remaining five béing thrown each time but all six coins being observed 
and all heads being counted? 

2. Determine the standard scores for all the midpoints in the distribution of Data 7A. 

Also determine the z scores for the following raw scores: 40, 55, 72, 85, 95. 

3. From Table B, determine the ordinate value at each midpoint of distribution 7A. 

4. Find the expected frequency for each class interval, and tabulate them and the 

observed frequencies in parallel columns. State some inferences that you can draw 


from your results. 
5, Find the best-fitting normal curve for Data 7A after the manner of Table 7.2. 


Plot the curve along with the obtained frequencies. 


152 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


6. Find the proportions and percentages of the areas under the normal curve between 
the mean and the following s scores: —2.15 —1.85 —0.19 +0.375 +1.1 
+3.52. 


Data 7A— DISTRIBUTION oF SPELLING-TEST Scores IN A SUPERIOR GROUP OF 
FRESHMEN* 


Scores f 
82-85 1 
78-81 8 
74-77 8 
70-73 5 
66-69 34 
62-65 21 
58-61 39 
54-57 32 
50-53 20 
46-49 7 
42-45 3 
38-41 0 
34-37 1 
Siess 179 
‘ Mean..... 61.1 
a 8.4 


* The test was one of the Cooperative series, and the Scores are T scores (see Ch, 12). 


7. Find the proportions and numbers of the cases to be expected betw 
and the following scores in Data 7A: 35 45 60 75 79.5 
8. Find the proportions of the area abov 


een the mean 
38.35. 


e the following z scores: +2.15 +1.62 
+0.175 —0.36 -19 —2.8. Also, below the following s scores: —3.85 
—1.225 —0.6745 +0.005 +1.75 +23. 


54.5 41.5. And 


below these scores: 85 45 56 35 77.5 41.5 61.5. Whenever possi- 


ble, compare expected with obtained frequenc: 

10. Find the proportions of the area falling between z scores of —1.50 to +1,25 
—0.05 to +2.76 +0.55 to +0.95 =2.78 to —1.12 +3.15 to +2.95 
—0.72 to —1.05 +1.24 to —0.33 

11. Find the proportions and numbers of cases to b 
between scores of 70 and 80 35 and 45 45and65 65.5 and 77.5 49.5 and 
57.5 45.5 and 65.5 65.5 and 69.6 61.5 and 65.6 53.5 and 57.5. Whenever 
possible, compare expected with obtained frequencies, 

12. Give in terms of standard measurem 
percentages of the cases fall in the normal d 
per cent. 

13. Give the z scores below which the followin, 
62 375 .418 .729 


ies. 


e expected in Distribution 74 


rents the Points above which the following 
istribution: 85, 55, 35, 42.3, 66.7, and 9.42 


8 Proportions of the cases will fall: .14 


THE NORMAL DISTRIBUTION CURVE 153 


14, Above what scores in Distribution 74 will the following percentages of the cases 
be expected: 12, 54, 84.13, 5.75, and 68.4 per cent? 

15. Below what scores in Distribution 7A should we expect the following number of 
cases: 11 63 89.5 123 162? Compare expected with actual cumula- 
tive frequencies. 

16. What z scores correspond to the following centile ranks: 75 62.5 16.7 5 
99? 

17. Between what score limits in Distribution 7A should we expect the middle 
80 per cent of the cases? The middle 50 per cent? The middle 90 per cent? Com- 
pare these with the interpolated limits for these same percentages. 


CHAPTER 8 
CORRELATION 


, 
\ ( No single statistical procedure has opened up so many new avenues of 
discovery in psychology and education as that of correlation. This is 
understandable when we remember that scientific progress depends upon 
finding out what things are co-related and what things are not. ( A coef- 
ficient of correlation is a single number that tells us to what extent two 
things are related; to what extent variations in the one go with variations 
in the other.) Without the knowledge of how one thing varies with 
another, we should find predictions impossible. And wherever causal 
relationships are involved, without knowledge of covariation, we should 
be unable to control one thing by manipulating another. X 
For example, when we know that the higher a girl’s score in a clerical- 
aptitude test the higher the average performance she is likely to exhibit 
after training, we can thereafter use scores on this test to predict level of 
proficiency. We say that there is a high positive correlation between 
aptitude-test score and clerical success. We discover this fact by finding 
a coefficient of correlation between scores of a number of girls and meas- 
ures of clerical performance later for the very same girls. We can never 
compute a coefficient of correlation on one person alone, nor can we com- 
pute it without having made two sets of measurements on the same indi- 
viduals, or on matched pairs of individuals. In this instance, if we con- 
sider that the aptitude test has measured individual differences in some 
quality or qualities that lead to success, Ż.e., in the sense of a cause of 
clerical success, then we can not only predict future success for individuals 
but also promote high general efficiency in any group of clerks by selecting 
those with high scores, Thus are studies leading to prediction and control 


methods, which is highly unlikely, 
| (THE MEANING or CORRELATION ) 
Some Examples of Correlation between Two Variables, 


-The coef- 
ficient of correlation is one of those summarizing numbers, like a mean 
154 


CORRELATION 155 


or a standard deviation, which, though it is a single number, tells a 
story. It can vary from a value of 
+1.00, which means perfect positive 
correlation, through zero, which means 
complete independence or no correlation 
whatever, on down to —1.00, which 
means perfect negative correlation. ) 
A Case of Perfect, Positive Correlation. 
Figure 8.1 illustrates an instance of 
perfect positive correlation. It is a 
fictitious case, for such exact agreement 
between two things is rarely or never 
experienced, certainly not in psychology 02 4 6 8 10 12°14 16 
or education. Here we have assumed 3 Test X r 
a: fae Fic. 8.1.—A simple correlation chart 
two tests, X and Y. Ten individuals showing the kind of relationship be- 
have received scores in the two tests. tween X and Y scores when the 


5 lation i 1.00. 
The pairs of scores are as follows: areln I ar 


Individual............ A B c D E F 
Score in test X........ 2 4 5 6 7 8 
Score in test Y......5. 4 6 7 8 9 10 


Looking down the rows of scores, each pair made by one individual, we 
readily conclude that each person’s score in Y is two points higher than 
his score in X. In terms of a simple equation, Y = X + 2. There are 
no exceptions, which makes the correlation perfect. 

To take another instance: 


Individual........++++ 


Score in test P....----- 
Score in test Q....----- 


In this situation, each person’s score in Q is two times that in P, again 
there is perfect agreement, and the coefficient of 


correlation would be +1.00. The equation for predicting Q from P is 


= 2P. k d 
` A Case of High Positive Correlation.—In Fig. 8.2, we have illustrated a 


case of correlation that is positive but less than +1.00. The graphic 


without exception; 


156 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND FDUCATION 


picture of the individuals shows that, in general, a person who is high in 
test X is also high in test Y, and one who is low in X is also likely to be 
low in Y. The actual scores for these 10 people are listed in the first 
two columns of Table 8.1. It will be 
seen that although the individuals are 
arranged in rank order for scores in X, 
there are some deviations from this 

` rank order when we inspect their 
scores in Y. The coefficient of corre- 
lation by computation is equal to 
+.76. We shall soon see how this 
was obtained but first simply note by 
comparison of Figs. 8.1 and 8.2 how 
the individuals are scattered in the 

O28 s =A y ee a diagrams. In Fig. 8.1, they line up 


Fic. 8.2.—A correlation chart illustrat- in perfect file from lowest to highest. 


ing the kind of situation when the Tig. 8.2 : 
correlation is +.76. In Fig. 8.2, they tend to fan out or 


to diverge from a strict line-up, but 
a definite trend of relationship can be observed. The amount of spread- 
ing in Fig. 8.2 as compared with that in Fig. 8.1 (in which it is, of course, 


none) illustrates the difference between correlations of +1.00 and +.76. 
A Case of Low Positive Correlation, 


A third instance is shown in Fig. 8.3, 
in which the spreading effect to which 14 
our attention was called before is even 12 
greater. The coefficient of correlation 
here is +.14; in other words, close to > 
zero. This being true, a person with @ 
high score in X is likely to be almost “ 
anywhere, within the total Tange, in 4 
terms of his Y score. The three highest 2 
0 
) 


Test Y 


8 
6 


people in X, with scores of 10, 12, and 


13, scatter all the way from 3 to 11 in 02 4°68 0 
test Y. The three lowest people in Fro. 8.3.—An Bhi f 1 
test X, with scores of 1, 3, and 4, tion chart when the fash ation i 
scatter all the way from 2 to 9 in test F. only +.14. 
Although there is a trace of relationshi 
it is very weak. The actual scores ma 

A Case of High Negative Correlation. 
there is a negative correlation is shown 
—.69. Compare this diagram with th: 


2 l4 16 


p between X scores and Y 
y be compared in Table 8.3. 

~The situation that obtains when 
in Fig. 8.4. Here the coefficient is 
at in Fig. 8.2, and it will be appar- 


scores, 


CORRELATION 157 


ent that the trend of the points is along the other diagonal now, from 
upper left to lower right. This illustrates the fact that persons making 
high scores in X are likely to make low scores in VY, and persons making 
low scores in X are likely to make high 
scores in V. This inverse order of rela- 
tionship is also apparent in the actual 
scores in the first two columns of Table 
8.2. The numerical size of the coeffici- 
ent (.69) is nearly the same as for the 
correlation in Fig. 8.2 (.76). It will be 
seen that the width of scatter of the 
points is about the same in the two 
cases. A perfect negative correlation 
would be pictured as a line of dots like 0 

that in Fig. 81 but it would slant © 2 4 818 0 RM 
downward instead of upward from left Frc. 8.4—An example of a correla- 
to right. The algebraic sign of the Bon olt when the correlation is 
coefficient of correlation therefore od 

merely has to do with the direction of the relationship between two things, 
whether direct or inverse, and the size of the coefficient (distance from 
zero) has to do with the strength or closeness of the relationship. 


How to COMPUTE A COEFFICIENT OF CORRELATION 


‘The Product-moment Coefficient of Correlation.—The standard kind 
of coefficient of correlation and the one most commonly computed is 
Pearson’s product-moment coefficient. \ The basic formula is 


= Ixy (Basic formula for a Pearson product-moment coefficient (8 1) 


Yay Nos, Of correlation) 
where rz, = correlation between X and F. 
x = deviation of any X score from the mean in test X. 
y = deviation of the corresponding Y score from the mean in 
test Y. 
Xxy = sum of all the products of deviations, each « deviation times 
its corresponding y deviation. 
Cz and oy = standard deviations of the distributions of X and Y scores. 


The steps necessary are illustrated in Table 8.1. They will be enumer- 


ated here: 
Step 1. List in parallel columns the paired X and Y scores, making sure 
that corresponding scores are together. 


158 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 8.1—CoRRELATION BETWEEN Two SETS OF MEASUREMENTS OF THE SAME 
INDIVIDUALS; UNGROUPED Data; PRODUCT-MOMENT COEFFICIENT OF CORRELATION 


X pm oe y x? x? xy 

13 11 +5.5 +3 30.25 9 +16.5 

12 14 +4.5 +6 20.25 36 +27.0 

10 11 +2.5 +3 6.25 9 +75 

10 7 +2.5 -i 6.25 1 — 2.5 

8 9 +0.5 +1 0.25 1 + 0.5 

6 11 —1.5 +3 2.25 9 — 4.5 

6 3 —1.5 —5 2.25 25 + 7.5 

5 7 —2.5 -1 6.25 1 + 2.5 

3 —4.5 =—2 20.25 4 + 9.0 

2 1 —5.5 -7 30.25 49 +38.5 

Sums 75 80 0.0 0 124.50 144 102.0 
Means 7.5 8.0 Dx? Dy? Ixy 

124.50 


o: = V-o = VIZI = 3.528 


T Ve = Vild = 3.795— 
— ey 102.0 — 102.0 _ 
= Nosy ~ (0)(3.53)(6.79) ~ 133.90 = +76 


Tey 


An alternative solution without computing the o’s: 


Tey 


Step 2. 
Step 3. 
Step 4. 
Step 5. 
Step 6. 
Step 7. 


Step 8. 


A Shorter Solution —There is an alt 
omits the computation of c, and oy; 
other purpose. The formula is 


Vay 


x ey 102.0 _ 1020 _ 102.0 _ 4.16 

V (2x*)(2y")-/(124.5)(144)—-4/17,928.0 133.90 ` 
Determine the two means M- and M,. In Table 8.1, these are 
7.5 and 8.0, respectively. 
Determine for eve 


ry pair of scores the two deviations x and y. 
Check them by finding algebraic sums, which should be zero. 
Square all the deviations, and list in two columns. This is for 
the purpose of computing c+ and gy. 

Sum the squares of the deviations to obtain Sx? and zy’. 

From these values compute o, and cy. 

For every person, find his xy product (last column of Table 8.1). 
Sum these for Day. 

We are now ready for formula (8.1). In the illustrative problem, 
the arithmetic is given following Table 8.1. 


ernative and shorter route that 
should they not be needed for any 


> P? 
Joeman (Alternative formula fora Pearson 7) (8.2) 


CORRELATION 159 


The solution with this formula is also given with Table 8.1, and it leads 
to the ‘same coefficient. In both cases, two significant digits have been 
saved in r, for the reason that for so small a number of cases the sampling 
error in r is so relatively large that-more than two digits would be rather 
deceiving as to accuracy. When W is large—200 or more—three-place 
accuracy in r may more properly be reported. 

Computing a Negative Coefficient—As another example of the compu- 
tation of r, when the correlation is negative, Table 8.2 is presented. The 
operations are just the same, step by step. The only thing new is the 
care that must be taken with algebraic signs. 


TABLE 8.2.—A NEGATIVE CorRELATION IN UNGROUPED DATA BY THE PRODUCT-MOMENT 


METHOD 
X y x y x? y xy 
12 T +5 —1.5 25 2:25 7.5 
10 3 +3 —5.5 9 30.25 —16:5 
9 8 +2 —0.5 4 2d mot (8) 
8 5 +1 -3.5 1 12.25 = 3:5 
7 7 0 —1.5 0 2.25 0.0 
7 12 0 +3.5 0 12.25 0.0 
6 10 -1 +1.5 1 2:25 — 15 
5 9 —2 +0.5 4 -25 — 1.0 
4 13 -3 +4.5 9 20.25 —13.5 
2 11 —5 +2.5 25 6.25 —12.5 
Sums 70 85 0 0.0 78 88.50 —57.0 
Mean 7.0 8.5 Ir? Iy? Ixy 


oz: = NE = V78 = 2.79 


= NEE = V8.85 = 2.97 

2570. e 57.0 

T= = (10)(2.79)(2.97) 82.863 

= —.69 
Computing r from Original Measurements —In both examples thus far, 
we have been dealing with a small number of observations and ungrouped 
data. When the data are more numerous, we resort to grouping into 
class intervals; but first let us see another procedure with ungrouped 
data, which does not require the use of deviations. It deals entirely 
with original scores. When raw scores are small numbers or when a good 
calculating machine is available, this is the best procedure. The formula 
may look forbidding but is really easy to apply: oe 

N2XY — ZOCD e (83) 


VINSX? — (SX)JINZY? — (ZY)] nal data) 


Yzy 


160 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where X and F are original scores in variables X and Y. Other symbols 
tell what is done with them. We follow the steps that are illustrated in 
Table 8.3. 


Step 1. Square all X and F measurements. 
Step 2. Find the XY product for every pair of scores. 


TABLE 8.3.—CORRELATION or UNCROUPED Data COMPUTED FROM THE ORIGINAL 


= MEASUREMENTS 
—_ — 
x Y x? ya XY 
13 7 169 49 91 
12 11 144 121 132 
10 3 100 9 30 
8 ri 64 49 56 
7 2 49 4 14 
6 12 36 144 72 
6 6 36 36 36 
4 2 16 4 8 
3 9 9 81 27 
1 6 1 36 6 
Sums 70 65 624 533 472 
=X 2Y rx? zy? =xXY 
r [VEXY — (2X)(2F)}? 
wv NEX: — (2X)I[VEY? — (SY)3 
(4,720 — 4,550)2 
~ (6,240 — 4 1900) (5,330 — 4,225) 
_ __(170)2 
(1,340) (1,105) 
_ _ 28,900 
. 1,480,700 
= .019518 
fzy = V/ 019518 
= 4.14 


Step 3. Sum the Xs, the y’ S, the X” 


s, the Y” ry? 
Step. Apply ious (8.3). s, and the XY’s. 


The author has found it more convenient p 
, 


articularly when machine work 
can be done, to compute r? first by the fı 


ormula 
a 
2 [VEXY — (2X)(SV)]2 
i= 
7O [NEX OXN CY (8.4) 
and then finally extract the Square root to find ra, as shown iust below 
Table 8.3. á ; 


CORRELATION 161 


Preparing a Scatter Diagram.—When N is large, even when NW 
moderate in size, and when no calculating machine is available, ie 
customary procedure is to group data in both XY and F and to form a 
scatter diagram or correlation diagram. The choice of size of class 
interval and limits of intervals follows much the same rules as were 
given in Ch. 3. For the sake of a clearer illustration of the procedure, 
a smaller number of classes will be employed in the problem now to be 
described. The data were scores earned by a class in educational measure- 
ments in two objectively scored examinations, one of which stressed sta- 
tistical methods and the other of which stressed tests and measurements. 

In setting up a double grouping of data, a table is prepared with columns 
and rows—columns for the dispersions of Y scores within each class 
interval for the X scale, and. rows for the dispersions of X scores within 
each class interval for the Y scale. Along the top of the table (see 
Table 8.4) are listed the score limits for the class intervals in test X. 
Along the left-hand margin are listed the score limits for the class inter- 
vals in test Y. We make one tally mark for each individual’s X and 
Y scores. For example, if one individual had a score of 83 in test X 
and a score of 121 in test Y, we place a tally mark for him in the cell of 
the diagram at the intersection of the column for interval 80-84 in X 
and the row for interval 120-124 in Y. All other individuals are similarly 


located in their proper cells. 


TABLE 8.4.—A SCATTER DIAGRAM OF THE SCORES IN Two ACHIEVEMENT TESTS 
X: Scores in First Achievement Test 
evenness 


— 

70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 ||, 4 
B| 135-139 ae TE A 
5 — 
Ee \ Pat a ay = 
E — 
EJ 125-129 Fai T2 |7 4 4 
Bl 120-124 71 Wy |W 4 [Wig |" 2 T 
F 115-119 Wi |W 5 |i W2 | 1 E aa 
Al 110-114 1 [Wa W 2 |WMy |g |# 2 z= 
n 
| 105-109 1 |? a r a | 5 |/ 4 55 
S| 100-104 1 |% 3 ra ff a 7 
8 
a| 95-99 |“ 2 2 
Pe 3 10 12 26 18 12 5 1 37 |N 

x 
v 


When the tallying is completed, we write the number of cases, or the 
cell frequency, in each of the cells. Next we sum the cell megunt in 
the rows separately, recording each frequency in the last column under 
the heading fy. When this column is filled, we have the total frequency 


kd 
¢ 


162 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


distribution for test V. We also sum the cell frequencies in all the 
columns, writing them in the bottom row with its heading fee When 
completed, this row gives us the total frequency distribution for test X. 
We can check the summing of the cell frequencies by adding up the last 
< row and last column. Their sums should, of course, both equal N, in 
this case, 87. The check does not, however, guarantee correct tallying. 
This can be checked partly when we correlate either test with another 
one and compare total frequency distributions or when we have knowledge 
of the correct frequency distribution of Y or of X from any other source. 
here are times when it is wise to do the entire tallying two times and to 
compare all cell frequencies in the two attempts. It is very easy to place 
a tally mark in the wrong cell. 

i Computing the Pearson r from a Scatter Diagram.—When the product- 

oment 7 is computed from a scatter diagram, the formula becomes 


Ty = s hie a r from grouped data) (8.5) 
; , ? ae ‘the oft $ : 
= where x’ and y’ = deviations from ‘the guessed mean in terms of the class 
ite interval as the unit. 
pi zand c', = corrections in X and F in ass-interval units. 
o’, and o'y = standard deviations in X i e in terms of the class 


interval as the unit. 
The details of a 


pplication of this equation will now be explained andy | 
illustrated. 


Bel 20, 

Cz = N 37 = .230 
eoe. Ify’ nes —30 
SSN ay 8s 

or fx 


3 206 z yi 
=y (eg) = g7 T 9529 = 4/23149 = 1.52 


_in the rows. 


CORRELATION 163 


TABLE 8.5.—ScaTTER DIAGRAM FOR COMPUTING A PEARSON r 
X: Examination in Statistics 


60-64 [65-69 | 70-74 | 75-79 | 60-84 [5-80 [20-94 95-99 
[135-139 lees i EA E 
[130-134 1| 
125-129 | 
120-124 1 
115-119] 7 


no-u4[? 14)? 4 |’ 2 
[105-109 


Measurements 
7 


mfo jooj o la] w 


100-104 


Y: Examination in Educational 


f 

moments, and their sum, in other words, 34’y’. It is best to begin with 
the idea that every cell has its own x’y’ product and to keep that idea in 
mind. In fact, it is well to determine the x’y’ product for every cell in 
which individuals fall and to write it in as was done in Table 8.5. 

The x’y’ product for antell is simply the product of the x’ value times 
"W y’ value of that cell, close watch being kept of algebraic signs. This 

atter is easily checked, of course, by making sure that the sign of every 
xy’ product is positive in the upper right quarter of the chart and also 
the lower left quarter, but they are all negative in the upper left and 
lower right quarters. This rule spresupposes that the X measurements 
are increasing from left to right and that the Y measurements are increas- 
ing from below upward. À La 

Having given every cell its x’y’ value and having recorded it in the 
upper left-hand corner of the cell, we next note how many individuals 
have that x’y’ value—in other words, the frequency in that cell. We 
multiply the cell product by the frequency, and in Table 8.5 these prod- 
ucts are recorded with algebraic sign in the lower right-hand corners of 
the cells. All that remains now is to summate them. We do this both 
in the columns and in the rows for the sake of checking, for this is an 
unusually critical number in the correlation formula, and because of the 


i i iving 1 ny opportunities for errors. 
many steps involved in deriving it there are many opp 


in Table 8.5 are devoted to the sums of fx’y’ values 


yo columns 1 EE i 
Te aed Po keep the sums of the positive products in one of these 


164 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


columns and the sums of the negative products in the other. The last 
two rows of the table are reserved likewise for summing the positive and 
negative sums in the columns. Summing everything in the last two 
columns (also in the last two rows) of the table gives us Yx’y’, and the 
two estimates should check exactly. For the illustrative problem, the 
positive sum is 134 and the negative is —14, leaving a net positive sum 
=x’y' of 120. We now have everything we need for calculating r. Apply- 
ing formula (8.5), we have 
120 E 
are T (.23)(—.345) 

(1.52)(1.57) 
_ 1.3793 + .0794 
2.3864 
_ 1.4587 
~ 2.3864 
= .61 


ta = 


INTERPRETATIONS OF A COEFFICIENT OF CORRELATION 


How High Is Any Given Coefficient of Correlation?—Any coefficient of 
correlation that is not zero and that is also statistically significant denotes 
some degree of relationship between two variables.t But we need further 
orientation on the matter, for the strength of relationship can be regarded 
from a number of points of view, and it is not correct from any one of 
these points of view to say that the degree of relationship is exactly pro- 
portional to r. The coefficient of correlation does not give directly any- 
thing like a percentage of relationship. We cannot say that an r of .50 
indicates two times the relationship that is indicated by an r of .25. Nor 
can we say that an increase in correlation from r = 40 to r = .60 is 
equivalent to an increase in correlation from r = .70 to .90. The coef- 
ficient of correlation is an index number, ‘not a m 
scale of equal units. s 

A General Verbal Description o; 
the size of r depends very much u 
the reasons why we computed it. 


easurement on a linear 


f Coefficients—Our interpretation of 
pon what we propose to do with it or 
What would be a large correlation 
regarded as a small one for another. 


relative matter; relative to the area 
of investigation in which we are working and to other factors. ) But 


taking correlations just at large, without particular regard to their use 


1 For a treatment of the topic of statistical si 


gnificance of a coefficient of correlation, 
see Ch. 9. 


CORRELATION 165 


and as a general orientation, we may say that the strength of relation- 
ship can be described roughly as follows for various 7’s: 


Less than .20....... Slight; almost negligible relationship 
220-40: 2222 x% Low correlation; definite but small relationship 
40-.70....... Moderate correlation; substantial relationship 
10-90; «2222s High correlation; marked relationship 
GORLOO). s sin see a Very high correlation; very dependable relationship ^ 


It should be said that the coefficients should be interpreted as stated only 
when, by comparison with the standard error of r, they prove to be 
significant. It should also be said that the same interpretations apply 
alike to negative and positive 7’s of the same numerical size. An r of 
—.60 indicates just as close a relationship as an r of +.60. 

Particular Uses Have a Bearing on Interpretation of r.—The general 
descriptive list just given should be qualified by making references to par- 
ticular uses of r. One common use is to indicate the agreement of scores 
on an aptitude test with measures of scholastic or of vocational success. 
Such a correlation is known as a validity coefficient. It is an index of the 
practical validity of a test. Chapter 18 will deal extensively with this 
subject. Common experience shows that the validity coefficient for a 
single test may be expected within the range from .00 to .60, with most 
of them in the lower half of that range. Validity coefficients for com- 
posite scores based upon combinations of several different kinds of tests 
are likely to be distinctly higher, ranging up to .80 in rare instances but 
hardly ever above the latter figure. Many who have employed tests for 
vocational guidance or vocational selection have followed a tradition 
which may be credited to C. L. Hull’ some twenty years ago, that the 


m validity coefficient for a test of practical usefulness is about .45. 


Recent experiences have shown that this standard is too rigid and that 
lidity which determine the 


there are many considerations other than va 
usefulness of a test in any given: situation, as will be shown in Ch. 15. 
It is well recognized that a reliability coefficient, which in very general 
a correlation of a test with itself, is usually a much higher figure 
Following the leadership of T. L. Kelley,? 
dition that to be sufficiently reliable for dis- 
als, a test should have a reliability coefficient 
n more liberal in this regard, allowing a 


minimu 


terms is 9 
than a validity coeficient. 

there has been a general tra 
criminating between individu 
of at least .94. Some have bee 


g- Yonkers-on-Hudson: World, 1928. Ch. 8. 


i testin 
1 Hull, C. L. Aptitude testi Yonkers-on-Hudson: 


i ducational measurements. 
2 Kelley, T. L. Interpretation of e 


World, 1927. Pp- 210f. 


166 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


minimum of .90, while others have been more demanding, with a require- 
ment of a minimum of .96. These standards are rarely attainable, and 
it is safe to say that most tests in use fail to meet them. Asa matter of 
fact, there are many very useful tests whose reliability coefficients are in 
the .80’s and even below. It is coming to be recognized that validity is 
much more important than reliability, and, in fact, it is possible for a 
test to be sufficiently valid for practical purposes without being very 
reliable. Tests with reliability coefficients as low as .35 have been found 
useful when utilized in batteries with other tests.! Such tests have been 
known with validities as high as .35. They could theoretically have 
validities much higher than that. Reliability and validity depend upon 
many considerations that we cannot go into here. These problems will 
be treated in Chs. 17 and 18. It is sufficient to say that one must be a 
relativist when dealing with problems of test reliability and validity. 
The student’s interpretation of a coefficient of correlation, like his inter- 
pretation of other statistics, is subject to considerable revision as he knows 
more about its uses. While these qualifications mentioned regarding reli- 
ability and validity need to be made, the fact remains that 
we expect reliability coefficients to be in the upper bra 


usually .80 to .98, and validity coefficients to be in th 
usually .00 to .80. 


When one is investigating a purely theoretical problem, even very small 
correlations, if statistically Significant (undoubtedly not zero), are often 
very indicative of a psychological law. Whenever a relationship between 
two variables is established beyond reasonable doubt, the fact that the 
correlation coefficient is small may merely mean that the measurement 


situation is contaminated by many things uncontrolled or not held con- 
stant. One can readily co 


if all irrelevant factors ha 
1.00 rather than .20, Fo 


in practice 
ckets of r values, 
e lower brackets, 


k Ered J. P. New standards for test evaluation. Educ. & Psychol. Meas., 1946, 


CORRELATION 167 


isolated. The fact that we obtain anything else is because of the inextri- 
cable interplay of variables that we cannot measure in isolation. 

The practical conclusion from this is that a correlation is always relative 
io the situation under which it is obtained, and its size does not represent any 
absolute natural or cosmic fact. To speak of ihe correlation between intelli- 
gence and scholarship is absurd. One needs to say which intelligence, 
measured under what circumstances, in what population, and to say what 
kind of scholarship, measured by what instruments, or judged by what 
standards. Always, the coefficient of correlation is purely relative to the cir- 
cumstances under which it was obtained and should be interpreted in the light 
of those circumstances; very rarely, certainly, in any absolute sense. 

How much faith one should place in any relationship shown by a 
coefficient of correlation also depends upon the urgency of the outcome. 
There are probably many medical treatments, such as some inoculations, 
vaccines, and the like, concerning which the knowledge is rather incom- 
plete, which are administered even though the correlation between the 
treatment and living (or between nontreatment and dying) is of the order 
of .10 to .20. Although the probabilities of living may be increased by 
only 1 per cent by the treatment, the saving of 1 life in 100 is regarded 
as worth the effort. If a procedure in education promised only 1 per cent 
improvement over guesswork, we should pay little attention to it, because 
the seriousness of the outcome would not justify the means. It may be 
said in passing, however, that failures to predict in vocational and edu- 
cational practice are more generally recognized by reason of correlational 
checkup than are failures to predict in medical practice, where correla- 
tional checkup is less often made. In addition to the difference in rela- 
s of the outcomes of prescription in the two cases, this 
of goodness of results may be an important 
dards of prescriptive accuracy demanded in 
ired in other fields. 


tive seriousnes! 
factor of better knowledge 
reason for the higher stan 
education than are sometimes requ 


GRAPHIC REPRESENTATIONS OF CORRELATIONS 


facts of correlation to the layman, who is probably 
not accustomed to thinking in terms of numerical indices in any case 
and who has probably never learned of the coefficient of correlation, it is 
better to convey the idea of a relationship in other ways, preferably in 
the form of a diagram of some kind. Figure 8.5 and 8.6 are two examples 
of how this might be done. Figure 8.5 is a bar diagram showing for each 
level of aptitude score, on a nine-point scale (stanine scale), the percentage 
of pilot students who graduated from flying schools. The actual per- 
centages are given for those who are interested in simple numbers. In 


In presenting the 


168 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


spite of the unusually large samples the percentages are given to two 
significant digits only. The number of students in each stanine group is 
given for those who have some appreciation of the’stability offered by 
large samples. ; ; 

The other diagram, Fig. 8.6, shows the average rating of flying pro- 
ficiency made by cadets at each stanine level, and only the average. 
Some investigators connect successive pairs of points with lines, but in 


Number of 

Percentage students 
0 10 20 30 40 50 60 70 80 90 100 
=, 


Pilot sfanine 


a ean east 
O 10 20 30 40 50 60 70 80 90 100 
Percentage Total 185,367 


Eliminated = Graduated 


Fic. 8.5.—Correlation between the pilot-aptitude score (pilot stanine) and the criterion 
of graduation-elimination from flying training in the AAF illustrated by a bar diagram. 
(Based upon Stanines: Selection and classification for air crew duty. Washington D.C.: 
Headquarters, Army Air Forces, 1946.) 


this particular instance the linear trend is so clear that a straight line 
has been drawn by inspection to fit the trend. It is assumed that minor 
deviations that occur are due to sampling errors. A warning should be 
given in connection with this type of figure. It can give an impression 
of degree of correlation far in excess of that justified. Not shown are 
the widths of dispersions of individuals, at different stanine levels, in this 
case. While the averages of columns do not deviate much from a straight 
line, many individual cases may deviate considerably. There are ways 
of representing average discrepancies of individuals from such a regression 


CORRELATION 169 


line (see Ch. 15) which could be used to give the reader some idea of their 
seriousness. 


œ 
a 


of flying proficiency 
œ 
O 


[N=1960) 


Instructors’ average ratings 


PS] 
a 


70 


f 2 & = & & 78 93 
Pilot aptitude score 


Fic. 8.6.—Correlation between pilot-aptitude scores and instructors’ ratings of flying 
proficiency illustrated by means of a regression line that is based upon the averages 


of ratings for different aptitude-score levels. 


f ASSUMPTIONS UNDERLYING THE PRODUCT-MOMENT CORRELATION ) 
{ 


The student should be warned before leaving this chapter concerning 
tions that should be observed in the use of the Pearson coef- 
The most important requirement for the legitimate 
use of the Pearson v is that the trend of relationship between Y and X be 
rectilinear, in other words, a straight-line regression) This can be deter- 
mined, as a rule, by inspection of the scatter diagram. If the distribu- 
f in the correlation diagram appears to be elliptical, 


ti cases withi l s 
oraa indications of a decided bending of the ellipse, the chances 
are that the relationship is rectilinear. Even if it is not, the deviation 


from a straight-line relationship may be so slight that we may assume 
rectilinearity as first approximation, and the degree of os indi- 
cated by r will be fairly close to any index of correlation like the corre- 
lation ratio (see Ch. 13) that is applied when there is curvature in the 
trend. When there is an obvious bending of the distribution : pager. 
a correlation ratio, or some other special coefficient, is indicated as the 


best index of correlation. 


the restric 
ficient of correlation. 


170 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


There are in educational and psychological measurements certain factors 
that produce artificially curved scatters in the correlation diagram. This 
may happen when one or both distributions taken alone are badly skewed 
and the skewing is produced artificially by the faulty measuring scale, 
with its systematically shifting unit of measurement. If there is good 
reason to believe that this may be the case, one solution would be to 
normalize the skewed distribution by methods described in Ch. 12. When 
distributions are corrected for skewness, the curvature in the regression is 
frequently eliminated, and linearity is then obtained. If curvature still 
remains, then the Pearson r is not to be used to indicate the amount of 
correlation: 

l There is nothing in what has been said to demand that the Pearson r is 
to be computed only with normal distributions. The forms of distribu- 
tions may be various, so long as they are fairly symmetrical and unimodal; 
even rectangular ones would do. The important consideration is whether 
in all columns the dispersions are approximately equal, as indicated by 
the column standard deviations, and also in all rows. This condition goes 
by the name homoscedasticity. When columns (and rows) are relatively 
homoscedastic, we may compute a Pearson r.\ This condition will pre- 
vail generally when the two distributions are fairly symmetrical within 
themselves; so we need not go so far as to compute standard deviations 
of columns and rows in order to find out.! It is when distributions are 
markedly skewed that significant departures from homoscedasticity occur. 

Figure 8.7 is presented to show graphically the kind of scatter plots 
one might expect when one or both distributions are symmetrical or 
skewed. In each diagram the form of distribution assumed is shown 
along the X or Y dimension. In diagram A both distributions are 
assumed to be normal. The probable scatter of the cases within the 
square area is elliptical. The contour of the ellipse (and of correspond- 
ing objects in the other diagrams) is not drawn so as to enclose all the 
cases but to include the central mass of them. The regression in dia- 
gram A is clearly rectilinear, and homoscedasticity prevails. 
B, X is normally distributed and Y is negatively skewed. The trend of 
the cases is definitely curved and the distribution is not homoscedastic in 
either vertical or horizontal arrays (“array” is a general term including 
both rows and onina): In diagram C, with skewing in the same direc- 
tion in both X and} distributions, the regression appears to be rectilinear 
but the dispersion is not homoscedastic. In diagram D, the skewing is in 

1 Some writers suggest that only when both distribution: 


the conditions be fully satisfied for computating a Pearson r. In practice probably no 
one insists upon normal distributions, p Z 


r 


In diagram 


s are normal or nearly so will 


CORRELATION a 
171 


opposite directions and there is neither rectilinearity nor homoscedastici 
Only in the case of diagram A would one justifiably com pee: Tae 
product-moment coefficient of correlation. In a later a k KORT 
other types of coefficients of correlation will be described whicth eke 
e 


pa. es : 
4 Y 

af x 
ey, AT 
y Y 


Trc. 8.7.—Hypothetical forms of scatter plotsin a correlation di 

alerts * 7 < . tb agra’ y. 

distribution of X and } values differ. Diagram A shows linear iatea eben ee or 

dasticity; B and D show curved regression and lack of homoscedasticity; and C shows 
3 ws 


linear regression but lack of homoscedasticity. 
applied to the data in diagrams B, C, and D if one could justify the 


appropriate assumptions that must be made. 
Exercises 


1. Using the first 10 pairs of scores in the list in Data 8A, compute a Pearson r between 
any two parts that you or your instructor selects. Use formulas 8.1 and 8.2. Finda 
similar coefficient, using the last 10 pairs of scores in the same two variables. State 
your conclusions. 

2. Correlate the first 10 pairs of. scores for any two other parts, using formulas 8.3 
and 8.4. Correlate the same two parts, using the last 10 pairs and the same formulas. 
State your conclusions. 

3. Prepare a scatter diagram for the correlation of Parts III and IV, or any other 
two parts, including all 40 cases. Compute a Pearson r using formula (8.5). State 
conclusions. A 


172 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Data 8A.—Scorres EARNED By Forty HIGH-SCHOOL STUDENTS IN SEVEN Parts OF 
THE GUILFORD-ZIMMERMAN APTITUDE SuRVEY* 


; Part V Part VI | Part VII 
Part I Part II Part IIT Part IV A : 
Verbal Com- | Reason- | Numerical | Perceptual opata enna! Merten 
prehension ing Operations] Speed dən: e a. gage 

22 il 24 29 27 39 30 
F: 3 z 40 16 23 21 

36 14 12 21 
32 8 72 32 21 20 33 
13 2 25 46 25 20 29 
24 5 30 47 2 6 8 
22 4 38 49 15 37 35 
35 1 54 53 34 28 16 
18 7 37 51 37 46 30 
13 10 ól 50 38 46 35 
53 23 56 45 22 41 38 
15 9 42 48 18 5 18 
34 18 30 25 40 58 46 
15 2 42 48 12 21 17 
27 4 28 28 31 26 24 
19 9 32 40 11 1 
29 4 24 37 26 3 p 
24 9 42 58 21 21 23 
27 9 54 54 23 20 30 
16 5 , 42 44 29 24 34 
56 12 67 48 20 
22 5 58 48 28 i 
32 4 57 33 20 4 16 
18 8 49 47 19 36 42 
24 15 87 52 36 34 26 
22 12 14 j 
22 10 38 46 21 5 a 
21 21 32 33 11 43 
13 10 52 40 29 a ay 
23 3 60 49 43 13 37 

2 10 29 
20 4 50 ee ae 2l 27 
25 11 76 B 6 F- 27 
14 6 40 38 35 3 ie 
11 2 32 56 38 4 56 
2 9 61 45 

38 17 56 67 H w 20 
16 6 61 42 29 35 
14 4 17 44 %6 "3 A 
23 a2 25 61 48 23 29 16 


* Part I is a vocabulary test; Part II is composed of arithmet: 
composed of simple number operations; Part IV is on matching visual objects differin little; 
Part V involves awareness of spatial relationships; Part VI requires imagination of an P i a 
in space; and Part VII is on common knowledge of tools and their use, automobile save cad Beets 
tions, and common trade knowledge. The intercorrelations in this Particular sample will be found 
to be generally low except between Parts I and II and between V and VI. a 


ic reasoning Problems; Part III is 


CORRELATION 173 


4, Do the same as in Exercise 3 for one or more other pairs of variables in Data 8A 
How many pairs of coefficients of correlation are possible with Data 8A? State í 
general rule for the number of intercorrelations when there are variables. 

5. Compute the Pearson r for Data 8B, Interpret your findings. 

6. Find five Pearson r coeflicients reported in the literature. Tell what variables 
were being correlated in each case. Interpret the results. Are the coefiicients about 
the sizes you would have expected for the things correlated? Were there any special 
conditions that may have biased the amount of correlation in one way or another? 


Dara 8B.—A SCATTER DIAGRAM OF REACTION-TIME MEASUREMENTS AND GRADES 
EARNED IN GENERAL PsycuoLocy 


Reaction time Grades in psychology 

to auditory 
stimulus 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 
.180-. 189 ' 1 

.170-. 179 1 

. 160-. 169 2 1 1 1 
.150-. 159 1 1 1 

.140-. 149 1 1 2 1 1 1 
.130-. 139 1 6 2 6 1 3 
.120-. 129 1 2 3 3 1 
.110-. 119 2 1 2 
100-. 109 1 


eS eae a 


CHAPTER 9 
THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 


In this chapter we raise the very important question as to how near 
the “truth” are statistical answers such as means, standard deviations, 
proportions, and the like. As was said before, any measured sample is 
usually employed to represent a larger population. A population, from 
the statistical point of view, is any arbitrarily defined group. The term 
will be more fully explained in later paragraphs. 

Our sampling has to be limited for practical reasons; we cannot measure 
total populations, or at least it is generally inefficient and unnecessary to 
do so. Yet we usually wish to generalize beyond our sample, arriving at 
scientific decisions that transcend the observations made at a particular 
time and in a particular place, or reaching administrative decisions that 
apply to larger groups of individuals. In preceding chapters we have 
been concerned with descriptive statistics only. The computed values 
were used to describe the properties of particular samples. If we want 
to apply those same descriptive statistics beyond the limits of samples, 
we must know how much risk of being wrong we take. In general terms, 
the statistics stressed in this chapter are designed to do that very thing. 
They are known as sampling statistics. 

To be more specific, when we obtain the mean of a sample that is 
measured in some respect, before we say that this obtained mean also 
describes the central tendency of the population sampled, we need to find 
some basis for believing that it does not deviate very far from the popu- 
lation mean. Fortunately, there is a statistical procedure that will inform 
us about how far our obtained mean probably deviates from the popu- 
lation mean, provided certain conditions, to be explained later, have been 
satisfied. The statistic that will do this is known as the standard error 
of the mean. In a similar manner, there are standard errors of other 


1 In some statistical writings a population mean is ri 
statistics likewise called true when reference is mad. 
better practice to steer clear of philosophical issues by 


174 


eferred to asa true mean, and other 
e to population values. It seems 
avoiding reference to truth, Some 


i 


OO 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 175 


Some PRINCIPLES OF SAMPLING 


Before going into the treatment of sampling statistics, it is necessary 
to have clearly in mind the essential facts about the process of sampling. 
The application of sampling statistics depends upon certain conditions of 
sampling. If these are not satisfied, standard errors, no matter how accu- 
rately computed, may give wrong impressions. At best, they give us 
only estimates from which we can make decisions and draw conclusions, 
never with complete conviction but with various degrees of assurance. 
After making this frank concession to the limitations of sampling sta- 
tistics, it should also be asserted that without them we can hardly draw 
any generalized conclusions at all that would be of scientific or practical 
value. 

Populations and Samples.—It is time that we had a better definition of 
population. Some statisticians call it universe. In any case, the statis- 
tician’s idea of population is quite different from the popular idea. Rarely 
would any statistical study regard the entire population of a nation, a 
city, or of some geographical region as its universe. The population in a 
statistical investigation is always arbitrarily defined by naming its unique 
properties. It might be the entering freshman class in a certain uni- 
versity, or the part of the freshman class entering a certain college or 
even a certain course. It might be the male sixteen-year-olds in a given 
school district; the children of Mexican parentage in a certain city; or 
the registered democratic voters in the New England states. All of these 
examples are of groups of human individuals. Populations could, of 
course, be defined as species, or phyla, or order of animals or of plants. 
There are also populations of observations or of reactions of a certain 
kind—simple reactions to sound stimuli, word-association reactions, judg- 
ments of pleasantness of colors, and the like, from the psychological labo- 
ratory. It is probably the nonhuman groups that have seemed to require 
the more general term universe as an alternative to the more restricted 
term population. In this volume we shall use the term population in the 

Il sets of individuals, objects, or reactions that 


to include a he 
oi de aving a unique pattern of qualities. 


ibed as h $ Bo ] 

i. a EL also find the term population used in this chapter ina 
mea ted and technical sense. Two samples may be said to come 
Page ii be shown that they are alike in just 


ion if it can 
from the same population if i : i 
a $ fy ikeun IQ, in score on a certain memory test, or in 
respect, e.g., 


P n of a very large number of means of 
ee lation mean as the mea! ‘ : ; j 
Wp a Stns cto the same population. bench of mean would pre 
B x ; i mean. 
mn ae almost identical with the actual population 
ably be nu y 


176 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


showing a like reaction to some proposal. The likeness is usually defined 
statistically in terms of equal means or proportions and equal variances 
or dispersions. It ordinarily takes rigorous statistical tests to satisfy the 
scientific investigator that two or more samples have come from the same 
population. The same population, here, then, is same perhaps in only 
one respect, ignoring differences in other respects. 

Parameters and Statistics —If we were to measure all the individuals of 
a population and actually to compute the indices of central tendency, 
dispersion, and correlation, as we ordinarily do for samples, we would 
obtain what the statisticians call parameters. The population parameters 


Distribution of a 
population 


Distribution of 
one 
random sample 


i Sir 
Standard F, 1 Standard 
deviation Population A sample deviation 
Ma pee) weer oe /e 
ihesrapeted (aparameter) (a statistic) P. 


(a statistic) 


Fic. 9.1.—A comparison of a population distribution and a sample distribution, also of 


population parameters and sample statistics. 


exist whether we compute them or not, if we ignore the dynamic changes 

that may be occurring and assume for practical purposes that these 

parameter values are fixed at least for a time, 
Figure 9.1 illustrates the distinction between 


ts population parameters 
and sample statistics. P 


The larger distribution is that of the entire popu- 
lation. The smaller distribution is of a sample drawn at random from 
that population. The population parameters, mean and standard devi- 
ation, are symbolized by M and 4, each with a tilde over it. It will be 
noted that in this particular sample the mean (M) and the standard devi- 
ation () do not coincide exactly in size with their corresponding parame- 
ters (M and @). This is characteristic. A second sample would be 
expected to have still different M and e, but also similar to Af and @ in 


THE RBLIABILITY AND SIGNIFICANCE OF STATISTICS 177 


size. The same sort of parallel could be illustrated with respect to pro- 
portions ( and $), semi-interquartile ranges (0 and Q), and coefficients 
of correlation (7 and 7). By careful and adequate sampling we hope to 
arrive at statistics that will approximate the corresponding parameters 
very closely. By means of standard errors and other sampling statistics, 
to be discussed later, we estimate how far our obtained statistics may 
have deviated from their corresponding parameters. 

Random Sampling.—It should be kept in mind that the use of sampling 
statistics (standard errors, and the like) rests on the assumption that the 
sampling has been random. ‘The best definition of random sampling is 
that it is selection of cases from the population in such a manner that 
every individual in the population has an equal chance of being chosen. 
This calls to mind a well-conducted lottery, selective-service numbers, 
coin tossing, throwing dice, and other operations which allow the “laws 
of chance”’ to operate freely. 

There are several ways of favoring random sampling from populations. 
For a population of individuals, if all members are arranged in alphabetical 
order and one wishes to draw one person in every hundred, the first case 
might be taken by blind pointing within the first hundred names and 
every hundredth one following in the list automatically chosen. Tables 
of random numbers have been published as an aid in random sampling.! 
The numbers themselves have been placed in sequence by some kind 
of lottery procedure. If individuals in a population are numbered in 
sequence and thus identified by number, selections can be made by fol- 
lowing the random numbers in any systematic way. A random sample 
should be fairly representative of the population, though in any particular 
sample, if it is a small one, in particular, by chance it may not be so 
representative as we would like. ; ; 

Biased Sampling—In a biased sample there is a systematic error. 
Certain types of cases have an advantage over others in being selected. 
The likelihood of individuals being chosen differs from one to another. A 
common example of this in educational research is the voluntary return 
of questionnaires. The names of those who are to receive the question- 
naires may, to be sure, be randomly chosen from a much larger group. 
But suppose that only 60 per cent of those circularized return the ques- 

: ires, which is not an atypical event. The 60 per cent who do 
aE y ‘ sibly be representative, but there is a strong 
return the data might Pee turn or not to return the instru- 


Presumption that in the decision to re j ; 
H. C., Random sampling numbers, Cambridge: Cambridge 
J. C., R: 


1 Examples are Tippett, L. J quiet, E. F. Statistical analysis in educational research, 
ndquist, L. f- 


University Press, 1927, and Li 
Boston: Houghton, 1940, Table 18. 


178 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ment there is room for biasing forces to work. _ Those forces may z 
may not be relevant to the content of the questionnaire itself. But i 
the information requested implies favorable or unfavorable facts about 
the respondant, his associates, or his work, it is quite natural to expect 
those with a “good” showing will be more inclined to reply than those 
with a “bad” showing. If the trait of cooperativeness or of responsibility 
or of dependability of the respondant is involved in the data or even cor- 
related with something wanted in the data, there is also a strong likeli- 
hood of bias. , 

A colossal example of biased sampling is that of the Literary Digest 
public-opinion poll during the 1936 presidential campaign. Several mil- 
lion post-card ballots were said to have been circulated, certainly antici- 
pating a sample of most generous size. But the mailing lists were made 
up from telephone directories and automobile registration lists. It so 
happened that in the poll the telephone subscribers and car owners voted 
with a majority in favor of the candidate who lost, while the nontelephone 
subscribers and noncar owners voted at the polls in a more decisive way 
for the successful candidate. Among those who received post-card ballots 
there was also probably a selection as to which ones would be most likely 
to take the trouble to return the card. Those who were most discon- 
tented with things as they were and wanted a change most would take 
the trouble to register a protest straw vote. Those who were contented 
or who felt somewhat secure as to the outcome would be less likely to 
return the card. This would also tend to make the vote appear to favor 
the losing candidate, who was running against an incumbent. 

The scientific investigator must be eternally vigilant to the possibility 
of biased sampling. A good, systematic control of experimental con- 
ditions is designed to prevent biased samples or to make known their 
effects. Where there is less than customary experimental control of the 
observations, every possible effort should be made to know the conditions 
under which the data are obtained. Thorough knowledge of the con- 
ditions should be a basis for deciding whether selection of cases has been 
biased. Knowledge of conditions is also essential for the sake of accurate 
definition of the population sampled. 

Stratification in Sampling —One common procedure that is introduced 
in sampling to help to prevent biases and also to assure a more repre- 
sentative sample is known as stratification. Stratification is a step in 
the direction of experimental control. It operates with subgroups of more 
homogeneous composition within the larger population, 

A very common example is to be found in public-opinion polling 
practices. Suppose the issue to be investigated is public attitude toward 
a certain piece of labor legislation. It is quite likely that people in the 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 179 


two major political parties would tend to lean in opposite directions on 
such an issue. It is probable that people of different socioeconomic 
categories—professional, business, office worker, semiskilled laborer, and 
unskilled laborer—would react with some systematic differences on the 
issue. It is possible, though not so likely, that individuals of the two 
sexes would tend to respond somewhat differently. Other divisions of 
the population, such as rural versus urban, regional, and educational 
groups, might also show systematic differences on the issue. In other 
words, subgroups of the population are considered with respect to any 
variable that is suspected of correlating appreciably with the variable 
being studied. It does not matter that some of the variables are them- 
selves intercorrelated unless such an intercorrelation is very high, in 
which case it would be superfluous to control selection of samples on 
both of two variables so closely related. 

Having decided which variables are important in sampling, the entire 
population is studied to see what proportions fall into each category, t.e., 
what proportions are Democrat or Republican; male or female; urban 
or rural; in each socioeconomic group; and so on. Any sample to be 
obtained, then, should have proportional representations from all sub- 
groups. Within each defined subpopulation, e.g., a male, professional, 
Republican, New England group, random sampling may then be carried 
out. Random selection of cases would also be made within each of the 
other defined subpopulations in appropriate numbers. The total sam- 
pling procedure here described has been called stratified-random sampling. 

The importance of the proportional-representation principle and its 
advantage over a purely random sampling can be readily demonstrated. 
Suppose that 55 per cent of the Republicans and 45 per cent of the 
Democrats are in favor of a certain labor bill. In the general population 
let us assume that 60 per cent are registered Democrats and 40 per cent 
are registered Republicans. Ina random sample of 100 voters one would 
expect in the long run to draw the two party representatives in about the 
same ratio, 60/40. This would vary from sample to sample, however, 
even to the extent that the majority could be reversed; for example, it 
could even be 45/55. In the typical polling sample we would expect a 

ajority of voters against the bill. If the sample should by chance con- 
tain a majority of Republicans, however, the majority might favor the 
bill. If stratification were applied, we would be sure to have in the 
sample the ratio 60/40, and with this restriction imposed upon the ran- 
dom sampling we should expect the general population sentiment to be 
more accurately reflected. Thus it can be seen that a stratified-random 
sample is likely to be more representative of a total population than is a 


purely random sample. 


180 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Purposive Samples—A purposive sample is one arbitrarily selected 
because there is good evidence that it is very representative of the total 
population. Experience has shown in public-opinion polling that there 
are certain states or regions that come close to national opinion time after 
time. If one is willing to depend upon this experience, one may use the 
limited population as the source of the sample to use as a “barometer” 
for the total population. This is a convenient procedure, but it has the 
disadvantage that much prior information must have been obtained. 
There is also a risk that conditions may change to the extent that the 
particular segment of population no longer represents the total or does 
not represent it on some new issue. 

Incidental Samples—The term incidental sam ple is applied to those 
samples that are taken because they are the most available.! Many a 
study has been made in psychology with students in classes of beginning 
psychology as the samples merely because they are most convenient. 
Results thus obtained can be generalized beyond such groups with con- 
siderable risk. Generalizations beyond any sample can be made safely 
only when we have defined the population that the sample represents in 
every significant detail. If we know the significant properties of the 
incidental sample well enough and can show that those properties apply 
to new individuals, those new individuals may be said to belong to the 
same population as the members of the sample. By “significant proper- 
ties” is meant those variables that correlate with the experimental vari- 
ables involved. They are the kind of properties considered above in 
connection with stratification of samples. It is unlikely that member- 
ship in political party would have much bearing upon the results of cer- 
tain experiments performed upon sophomores in a beginning psychology 
course, but such variables as age, education, social background, and the 
like may definitely be pertinent. Much depends upon the experimental 
variable under study; whether it is a motor skill or a social attitude, & 
suggestible reaction or an interest-test score. If incidental samples are 
employed, the investigator is under scientific obligation to describe the 


properties of his group in all aspects that he can conceive as being related 
to the outcome of the investigation, 


THE RELIABILITY OF AVERAGES 


t Such a sample is often called “accidental.” In no 
dent; it was selected. It would be an “accident,” 
usefully a population in which we want to make pre 


real sense is the sample an acci- 
of course, if the sample represented 
dictions of parameters. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 181 


(ē) is 10.0 on the measuring scale we are using. Such a distribution is 
illustrated by the top diagram in Fig. 9.2. We do not know these popu- 
lation parameters ordinarily, but for the sake of an illustration we will 
assume that we do know them here. 

Sampling Distributions.—Suppose, next, that we proceed to draw ran- 
dom samples, all of equal size, one at a time, from this population. To 
satisfy the conditions of random sampling in a strictly mathematical 
sense, we should replace each sample drawn, after noting the value of 
each of its members, before drawing the next sample. Each individual 
should have an equal opportunity of being selected in every sample. 
Having lost one sample, the population is different from what it was 
originally. When the population is very large, as compared with the size 
of sample, however, we can forget about this replacement requirement for 
practical purposes. In this case, one sample would “hardly be missed”; 
that is, its loss would change the chance conditions to an inconsequential 
degree. We will find, later, that when the size of sample is not decidedly 
smaller than the population, it is possible to make allowance for this fact. 

To take a specific example of random sampling, with the same popu- 
lation described above in mind, let the size of sample be 25. The sample 
mean will not only differ from simple to sample, but will also usually 
deviate from the population parameter (in this example, the mean of 50.0). 
If we have a number of such sample means, we may treat them just as if 
each were a single observation and set up a frequency distribution of them. 
This is known as a sampling distribution. Such a frequency distribution 
will be close to the normal form. Normality of distribution of single cases 
in the total population favors normal distribution of means and of other 
statistics computed from samples drawn from that population. Even 
when the population distribution departs from normality, however, the 
distribution of means of samples drawn from it tend to be normally dis- 
tributed, unless too small. The smaller the sample, the more does the 
form of distribution of the population affect the form of distribution of 
the means. The extreme case would be samples of only one case each, 
in which event we should expect the distribution of means (if means of 
one observation each have any real meaning) to be of the same form as 
that of the population. a Pa 

A knowledge of the form of sampling distribution of a statistic is very 
important. Our ability to draw conclusions known technically as statisti- 
cal inferences depends upon knowing the form of distribution of sample 
statistics. Without knowledge of the form of sampling distribution, many 
a scientific result would remain inconclusive. The reasons for this will be 
clearer as we go into the subject of interpretation of standard errors. 


182 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The Standard Error of a Mean.—aAt this stage of getting acquainted 
with sampling distributions, we are most interested in the dispersion of 
statistics, in this case, the dispersion of sample means. The reason is 
that the amount of this dispersion gives us the clue as to how far such 
sample means may be expected to depart from the population mean. If 
we are to use a sample mean as an estimate of the population mean, any 
deviation of such a sample mean from the population mean may be 
regarded as an error of estimation. The standard error of a mean tells 
us how large these errors of estimation are in any particular sampling 
situation. The standard error of a mean is a standard deviation of the 
distribution of sample means. To distinguish such a standard deviation 
from the more familiar one that applies to dispersions of individual obser- 
vations, we call it a standard error. In later discussions it may be referred 
to by use of the abbreviation SE. 
` In order actually to compute the standard error of a mean, we need 
two items of information: the population parameter ¢, and the size of 
sample Ņ. Since we do not ordinarily know @, it would seem that we 
could but rarely, indeed very rarely, compute this standard error. ‘There 
are satisfactory ways of estimating it, however, as we shall see later. The 
formula for computing the standard error of a mean is 


Be = (Standard error of an arithmetic. mean computed from (9.1) 


N a known population parameter) 


where ¢ = standard deviation of the population. 
N = number of cases in the sample (not the number of means in 
the distribution of means). 

Sample Size and the Standard Error of a Mean.—The standard error of 
the mean is therefore directly proportional to the standard deviation of the 
population and inversely proportional to the size of the sample. More 
precisely stated, és is inversely proportional to the square root of the size 
of sample. As the individuals of a population scatter more widely, SO 
will the means of samples drawn from that population also scatter more 
widely. But as we include more individuals in each sample drawn, the 
less widely can the means scatter from their central tendency. In the 
limiting case, if the sample includes the entire population, the deviation 
of the sample mean from the population mean can then be only zero, 
and Gs is zero. In Fig. 9.2 are shown graphically several instances of 
samples when WV varies. The smallest possible sample occurs when N = 1- 
The mean of each sample is then identical with the individual’s measure- 
ment in that sample. The dispersion of such means is as great as the 
dispersion of the total population; és then equals ¢, which we have 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 183 


assumed to equal 10. When each sample contains two cases 
r 
ae 10 
M ay — 
V2 


When each sample contains four cases, Gy = 10/.1/4 = 5; ete. The 
remaining cases in Fig. 9.2 should now speak for themselves. 


= 7.07 


Distribution of individual measures 


for a whole population G=/0 
Distribution of means for aa 
samples of one case each Gy=/0 


Distribution of means for 
samples of two cases each 


Distribution of means for = 
samples of three cases each D7518 


Distribution of means for Ans 
samples of four cases each GS 

Distribution of means for 

samples of 16 cases each G2. 

Distribution of means for n 
samples of 25 cases each z2 


Frc. 9.2.—Showing the hypothetical decrease in variability or fluctuation of the means of 
samples as we increase the size of the sample drawn at random from a large population. 
(Modified from Lindquist, A first course in statistics. Houghton Mifflin, by permission.) 
Estimating the Standard Error of a Mean from o.—Formula (9.1) requires 
our knowing the parameter ¢ in order to compute the standard error of a 
mean. In ordinary practice we must be satisfied with an estimate of this 
standard error. Ordinarily, we have only one sample and its standard 


184 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


deviation must be utilized as a basis for estimating ¢, and hence for esti- 
mation Gy. When the sample is known, the formula generally used reads 


o A 
tr e (Standard error of a mean estimated from o) (9.2) 


N—1 
where ox, = what is ordinarily called the standard error of a mean (as 
estimated from a). 
standard deviation of the sample. 

N = number of cases in the sample. 

Strictly speaking, we might well have used the symbol sre in place of ow, 
to tell the story of its being an estimate of és. The symbol ös corre- 
sponds to øx in much the same way that @ corresponds to ø, the one 
being a parameter and the other an estimate of it given by a sample. 
That is, ¢ may be regarded as õe. We shall see that it is a biased esti- 
mate, but nevertheless an estimate, and we can allow in part for the bias. 

Estimating & from a Sample—The standard deviation in a sample is 
likely to be smaller than that for the population from which the sample 
came. Recall from the discussion in an earlier chapter (Ch. 5) that as 
samples become small the total range of measures is more and more cur- 
tailed. This comes about from the fact that extreme deviations in the 
population are rare and in small samples are likely to be missed. This 
fact has an effect also upon the size of the standard deviation, though 
the latter effect is much less drastic than the effect on the range. In 
small samples, particularly those with NV less than 30, the sample o gives 
an estimate of the population ē that is biased downward. 
unbiased estimate of ¢ directly from thi 
can use the formula 


ll 


Cc 


If we want an 
€ sums of squares of a sample, we 


e x? (Unbiased estimate of population standard devia- 
Se =A ya tion) (9.3) 
where Xx? = sum of squares in the sample. 
N = number of cases in the sample. 
Degrees of Freedom.—Formula (9.3 


) contains an important new concept 
that will be found liberally utilized 


hereafter when sampling errors (devi- 


s) are mentioned in connection with 
small samples. Compare formula (9.3) with the basic one for the stand- 


ard deviation of a sample (formula 5.5), and it will be found that they 
are identical except for the denominators, which are (N — 1) and Ñ, 
respectively. The difference between the two may seem very slight (and 
it is slight numerically when N is reasonably large), but there is a very 
important difference in meaning. In this particular formula, (WV — 1) is 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 185 


known as the number of degrees of freedom, which is symbolized by df. 
This is a key concept in recent years in what has been known as small- 
sample statistics. The number of degrees of freedom will not always be 
(N — 1) but will vary from one statistic to another as will be pointed out 
in various places later. Let us see why the number is (V — 1) here. 

The “freedom” part of the concept means freedom to vary. The stand- 
ard deviation is computed from the variance, and the variance is com- 
puted from deviations from the mean. Statisticians often express the 
matter by saying that one degree of freedom is “used up” when we com- 
pute the mean of a sample. This leaves (V — 1) degrees of freedom for 
estimating the population variance and the standard deviation. 

A numerical example will make this clearer. Let us assume five 
measurements: 5, 7, 10, 12, and 16, the mean of which is 10.0. A mathe- 
matical requirement or property of the arithmetic mean is that the sum 
of the deviations from it equals zero. The five deviations in this sample 
are —5, —3, 0, +2, and +6, the sum of which is zero. With this con- 
dition satisfied, i.e., the sum equal to zero, how many of these deviations 
could be simultaneously altered (as if by taking new samplings) and still 
leave the sum equal to zero? With a little thought or trial and error it 
will be seen that if any four are arbitrarily changed, the fifth is thereby 
fixed. We could make the first four —8, —4, +1, and —2, which would 
mean that for the sum to equal zero the fifth has to be +13. Try any 
other changes and if the sum is to remain zero one of the five deviations is 
automatically determined. Thus only four (W — 1) are “free to vary” 
within the restriction imposed. The restriction is that the mean is taken 
as fixed for the sample. In this sense, the computation of the mean 
“uses up” one degree of freedom. There were N degrees of freedom in 
computing the mean because the cases were presumably sampled entirely 
independently. If they were not independently sampled then there were 
also less than V degrees of freedom in computing the mean. We shall 
see examples of this later. Freedom means independence and only when 
there is independence of observations can the “laws of chance” operate 
freely and the mathematics based upon the “laws of chance” be applied.' 

The Sample Mean as an Estimate of the Population Mean.—Since the 
sample o is a biased estimate of the population é, one might expect to 
hear that the sample M is also a biased estimate of the population M 7 
On the other hand, from the discussion in the preceding paragraph, in 
which it was pointed out that no loss of degrees of freedom affected the 
n of the mean, one might conclude that there is no bias in 


computatio: 
n of the general subject of degrees of freedom, see Walker, 


IF ¿cellent discussio. 
Paean J. educ. Psychol., 1940, 31, 253-260. 


H. M., Degrees of freedom. 


186 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


using the sample mean as an estimate of the population mean. It is true 
that while the sample ø systematically (7.¢., in the long run) underestimates 
the population ¢, the sample M is an unbiased estimate of the popula- 
tion 1. It does not coincide with the population mean, except by chance, 
but it overestimates JT as often as it underestimates it. 

Other Relations of ¢ and &—While formula (9.2) is the practical one to 
use for estimating the standard error of the mean, there are other relation- 
ships that may prove to add meaning to this discussion if they do not 
contribute formulas that are of practical utility from time to time. 

If we have already computed the sample o and have lost the data on 


sums of squares, we may still estimate ¢ if we care to do so from the 
formula 


N 
N-1 


Got = 0 (Population estimated from sample o) (9.4) 


where the symbols are as defined previously. 
If we divide formula (9.4) through by c, we have 


N z 
AETI (Ratio of population & to sample o) (9.5) 


Te 
Co 


And if we square both sides of this equation, we have 


a Fai (Ratio of population variance to sample variance) (9.6) 


In other words, the ratio of the 
ance is the ratio of N to (N — 1). 
(N — 1) approaches N 

If we are not interes 
estimate G1, directly fr 


population variance to the sample vari- 

The larger V becomes, the more closely 
and consequently the more closely ø approaches ¢- 
ted in knowing the size of ¢ or even of o, we may 
om the sum of squares by the formula 


zx? (Standard error of a mean di 
a: { anda tly fı 7 
ot Tar 5 erd an directly from sum of (9.7) 


in which the symbols are as previously defined. 

Interpretation of the Standard Error of 
apply the standard-error formula to a concrete instance, To revive an 
old illustration, the ink-blot data, we find that ¢ is 10.45, and WV is 50. 
Applying formula (9.2), ox = 10.45/./49 = 10.45/7 = 1.49, The stand- 
ard error of the mean of the ink-blot scores is 1.49, or 1.5. What we are 
asking when we compute this standard error is how far from the popu- 
lation mean the sample means like the one we obtained would vary. We 


a Mean.—We are now ready to 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 187 


do not know what the population mean is, but from the value 1.5, we 
conclude that means of samples of 50 cases each would not deviate from 
it in either direction more than 1.5 units about two-thirds of the time. 
The interpretation of a standard error of a mean is in the latter respect 
like that of a standard deviation of a sample. The range from —1¢ to 
++1c in both cases includes about two-thirds of the cases when the dis- 
tribution is normal. Here we see the definite advantage of being able to 
assume a normal form of distribution for the means. In the sample, we 
know the value of the mean about which the cases vary, however, whereas 
in the distribution of means we do not know the value of the population 
mean about which those means vary. 

What is the good, then, of knowing the standard error of a mean? 
There are several answers to this question. If we know that two-thirds 
of all sample means probably do not deviate more than 1.5 units from 
the population mean, we know that our obtained mean is probably not 
more than 1.5 units distant from the population mean. Since fwo-thirds 
of such sample means are probably not over 1.5 units from the population 
mean and one-third of them are probably more than 1.5 units from it, 
we can also say that the odds are 2 to 1 that the obtained mean does not 
differ from the population mean by more than 1.5 units. We have thus 
bracketed the estimate of the population mean to this extent. (The 
smaller the ox, the narrower is this bracketing and the greater confidence 
we have in an obtained mean as an estimate of the population mean. 
Thus our degree of confidence in a statistic is related clearly to the size 
ofits SE. The larger the SE, the less confidence we have in the statistic. 
The statistic still describes the particular sample, however, even when the 
SE is relatively large. But our confidence in generalizing from it depends 
upon its SE. s 

The odds Of 2 to 1 are not regarded as heavy odds in statistics, though 
they may be so considered in gambling. This standard of confidence is 
usually regarded by statisticians as being entirely too low. We ordinarily 
want a great deal more assurance concerning a statistical result. If we 
allow wider margins, let us say of 2g either way in the normal distribu- 
tion, we have approximately 95 per cent of the sample means included. 
and approximately 5 per cent (2.5 per cent in each tail of the curve) 
beyond those limits.! In the ink-blot data the deviations at 2e are 
2.98 units, or approximately 3.0 units, distant from the population mean. 


. ‘that includes the middle 95 per cent 
i ang, the z distance from the mean clu p! 
r ‘Seat pea a distribution is +1.96r. In this interpretation of sampling 


tatisti can afford to be so rough as to ignore .04 of a o unless some very refined 
statistics we Ca 


decision is at stake. 


188 FUNDAMENTAL STA TISTICS IN PSYCHOLOGY AND EDUCATION 


We could say that there are only 5 chances in a hundred that a sample 
mean (when N is 50) will deviate more than 3 units (in either direction) 
from the population mean. We could also say that the odds are about 
19 to 1 that a sample mean will not be so far as 3 units distant from the 
population mean. 

Let us apply the interpretation of os to some other data. The practical 
usefulness of a statistic is often more apparent when comparing the same 
statistic derived from different data. In Table 9.1 are given means of 
Army General Classification Test scores for samples derived from differ- 
ent civilian occupational groups. For the sake of an illustration, we will 
assume that each occupational group represents a different population, 
as designated, and that the sampling of scores was random. What do 
the standard errors in this table tell us? 


TABLE 9.1.—COMPARISON OF MEANS OF SCORES ON THE Army GENERAL CLASSIFICATION 
TEST AS APPLIED TO MEN FROM DIFFERENT CIVILIAN OCCUPATIONAL CATEGORTES* 


Occupation N M o om 
Accountant. -aias ures 172 128.1 11.7 0.88 
Da WYER a scare sued ah Se 94 127.6 10.9 1.13 
Reporter. s.s: 0-04 026% 45 124.5 tT 1.76 
Sales clerk. 492 109.2 16.3 0.74 
Plumber.......- 128 102.7 16.0 .| 1.42 
Truck driver. „| 817 96.2 19.7 0.69 
Farm hand.....-... ssal EE 91.4 20.7 0.72 
TQAMSEERS is ases as 77 87.7 19.6 2.23 


* From Harrell, T. W., and Harrell, M. E. Army General Classification Test scores for civilian 
occupations, Educ. & Psychol. Meas., 1945, 6, 229-240. By permission of the publisher, 


The mean in which we would have the greatest confidence, as repre- 
senting the status of the general occupational population, is that for the 
truck driver. The odds are about 2 to 1 that this sample mean of 96.2 
does not deviate more than .7 from the mean of all truck drivers that this 
sample represents. We could be practically certain (allowing a margin 
of +30) that the obtained mean for truck drivers is not over 2 units 
distant from that of all truck drivers of this kind. The mean in which 
we have least confidence is that for teamsters by reason of its oy of 2.23. 

Incidentally, the relation of ox, to both e and N can be seen roughly by 
comparison of the data for the occupational groups. On the whole, the 
largest standard errors come for samples where N is smallest—for lawyer, 
reporter, plumber, and teamster—though the rank orders are not perfect 
within this list of four. Where sample sizes are comparable, as for lawyer 
and teamster, and for accountant and plumber, the value for oy is more 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 189 


apparently in proportion to the standard deviation of the sample. It can 
be seen that if the sample is large enough, the margin of error in an 
obtained mean (the margin being measured by the standard error) can 
be brought below one scale unit. 

The Accuracy of Published Means.—The last paragraphs illustrate the 
point that any obtained mean really stands for a region rather than for a 
point, when it is used to estimate a population mean. This suggests that 
when a mean is so used we pay some attention to its standard error when 
deciding how many decimal places to report. If the standard error is 
greater than one scale unit, is there much use in reporting the mean to 
one or two decimal places? In this connection, the author recalls the 
mathematician who chided a graduate student for reporting coefficients 
of correlation to four decimal places “when the last three digits are 
probably wrong.” 

Statisticians are not agreed upon the rules governing the number of 
places to report when the standard error is considered. Kelley has pro- 
posed the rule that a published statistic should be terminated with “the 
decimal place given by the first figure of one-third of its standard error.”? 
By “first figure,” Kelley evidently means “first significant (nonzero) 
figure.” Let us apply this rule to the means in Table 9.1. One-third of 
the oar for accountants is .29. The first significant figure is in the first 
decimal place, hence the mean may be reported to one decimal place. 
Even for the teamsters, where one-third of ow is .74, the first digit does 
not go into the unit column and hence we may report the mean to one 
decimal place. 


As Kelley points out, th 
large as one-third of a stan 


ere are 74 chances in 100 that a deviation as 
dard error can occur by random sampling. 
Thus, errors greater than this standard are more likely than not to occur. 
Tt is the author’s view that one might better require a limit of two-thirds 
of ou, beyond which about 50 per cent of the sample means would be 
expected. With this standard, the mean for teamsters In Table 9.1 would 
be reported to the nearest whole number; 88 rather than 87.7. 

It should be remembered that this discussion applies only to the use of 
sample means used to describe populations. The mean used to describe 
any particular sample might well be reported to one digit beyond the last 
in the observed measurements, as was recommended in Ch. 4, Even when 
taken to represent population values, means may justifiably be reported 
to more places than the rules just mentioned would permit if it is believed 
that some reader may want to use those values for checking or for fur- 


1 Kelley, T. L., Fundamentals of statistics. Cambridge: Harvard University Press, 


1947. P, 223. 


190 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ther computations. It is most important to keep these rules in mind 
when we interpret statistics that we read, when they are intended to 
indicate population values. 

Means of Future Samples——Note that the interpretations of om given 
above say nothing about the means of future samples drawn from the 
same population and how far they may deviate from the obtained mean. 
For all we know, the one we obtained may be the highest or the lowest 
within the total range of obtainable sample means. The dispersion of 
sample means is always around the population mean as the point of reference; 
never, except by rare chance, around the obtained mean. We cannot 
make any very accurate prediction about where future sample means will 
fall, therefore, though, of course, they will be expected somewhere near 
the obtained mean. If we knew the value of the population mean we 
could certainly make such a prediction and the standard error would 
inform us of the probable sizé of error of our prediction. 

When Sampling Is Not Random.—It has been repeatedly stressed that 
sampling statistics, including standard errors, apply only when sampling 
has been random. The reason for this is that the mathematics of the 
situation are exact only when sampling has been random. Any condi- 
tion that tends to interfere with randomness of selection of observations, 
therefore, will make the estimation of standard errors and their appli- 
cation in drawing conclusions inaccurate, if not misleading. There are 
several noteworthy situations that depart from the random requirement. 
Some would lead to standard errors that are too small to describe the 
actual distributions of means, and others would lead to standard errors 
that are too large. In the former error, we would have too much confi- 
dence in the accuracy of the mean, and in the latter case we would have 
too little. There have been developed certain variations in the standard- 
error formulas to take care of some of the special situations. 

Samples with Bias—The effect of biased sampling upon the distribu- 
tion of means can be strikingly illustrated by reference to some data on 
the training of pilots in the AAF during World War II. All pilot stu- 
dents were given a battery of classification tests from which was derived 
for each man a “pilot stanine” or composite pilot-aptitude score. Every 
month at the completion of preflight training, students were formed into 
class groups, each sent to a different primary flying school. In one study 
which covered a six-month period, 269 such classes had been sent to 
58 training schools divided among three AAF F lying Training Commands. 
The mean stanine for approximately 52,000 students was 5.56. This 
value may be taken as the population mean in this situation. The 
standard deviation of the population was assumed to be 1.96. The 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 191 


average size of sample (each class group in a single school) was 195.1 
From this information, using formula (9.1) we compute a standard error 
of 0.14. From this we would expect two-thirds of the 269 mean stanines 
to deviate not more than 0.14 from 5.56, if the sampling had been random. 
What are the facts? 

When the 269 means were actually compiled in a frequency distribu- 
tion and their standard deviation computed, the dispersion of means was 
actually found to be very much larger than was expected (see Table 9.2). 


TABLE 9.2.—SAMPLING STATISTICS CONCERNING 269 CLass GROUPS or PILOTS IN 
PRIMARY TRAINING DURING A PERIOD OF Srx MONTHS IN THREE TRAINING 
COMMANDS OF THE AAF purtnc Wortp War II* 


Expected results Obtained results 
Variable 
om Range M Ca Range 
Pilot stanine........... 0.14 5.2-6.0 5.56 0.37 4.6-6.9 
Graduation rate........ 3.4 56-75 65.3 9.5 40-90 
Validity coefficient. .... 073 0.32-0.74 0.53 -088 | 0.21-0.71 


* Including the pilot stanine, or composite pilot-aptitude score; the graduation rate or percentage 
of a class graduating; and validity coefficient, a biserial coefficient of correlation between stanine and 
graduation versus elimination. 


Where one would expect a range of means within the limits 5.2 to 6.0, 
the actual range was from 4.6 to 6.9. Where the expected standard devi- 
ation of the distribution of means was 0.14, the actual standard deviation 
was 0.37. A comparison of the expected and obtained distribution of 
means is shown in Fig. 9.3. 

The obvious conclusion is that the sampling of aviation students in 
pilot classes was most probably not random. One can surmise some of 
the causes after looking into the procedures by which class groups were 
made up. In each preflight class (¢.e., each month) a small percentage of 
students would fail to pass the curriculum successfully and would be held 
over probably to qualify for flight training in the next class. There wasa 
tendency for the “holdovers” to be sent together to the same flight schools. 
They tended to be of low pilot aptitude. There may have been some 
geographical differences in pilot aptitude which would tend to make the 
averages of stanines differ systematically somewhat from one Command 
to another. This hypothesis could be subjected to experimental check 
by comparing Command averages. There were probably other reasons 


1 Actually, some classes deviated from 195 in number. For the sake of an illustration, 
however, we may treat the samples as if they were of constant size. 


192 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCA TION 


for students of similar aptitudes to gravitate together, hence the biasing 
of samples. 

Another study was made of the graduation rates (percentage of a class 
group graduating) in different samples. The pertinent data are given in 
Table 9.2. From the over-all graduation rate of 65.3 and the size of 
sample, we would expect (by formula 9.18) a standard deviation of the 
distribution of the 269 rates to be 3.4. Actually it was 9.5. Since the 
probability of graduation for any cadet was strongly correlated with his 
aptitude score, we would expect the bias in sampling on aptitude to be 
reflected in biased samples as to graduation rate. This is probably not 
the whole story, however. There were many other conditions which 
could contribute to marked variations in graduation rate besides the 
variations in aptitude. Weather conditions varied from school to school 
and from month to month. Training practices and policies may have 
varied, in spite of close regulation. Instructor and test-pilot judgments 
were not standardized hurdles and may have varied from school to school. 

A third study is mentioned now for comparison, although it involves 
the sampling errors of coefficients of correlation which are treated later. 
This study is concerned with the variation in validity coefficients in the 
same 269 class groups. The validity of the pilot stanine for predicting 
the training success of pilots was indicated by what is known as the 
biserial coefficient of correlation (see Ch. 13). This has the same value 
as a Pearson product-moment r, but is computed when one of the varj- 
ables, assumed to be normally distributed actually, is forced into two 
categories. The two categories for the training criterion were the gradu- 
ates and the eliminees. The standard error for a biserial correlation equal 
to .53 when the size of sample is 195 amounts to .073 (computed by 
formula 13.8). The expected and obtained statistics are given in Table 9.2 
and illustrated in Fig. 9.3. In drawing the distribution curve, normal dis- 
tribution of the coefficients was assumed, whereas the expected distribu- 
tion should be slightly negatively skewed. The obtained distribution of 
the 269 coefficients was actually so skewed. At any rate, since the 
obtained standard deviation was only .088 and not so very different from 
the expected one (.073), we may conclude that if there was biased sam- 
pling with respect to the validity of pilot stanines it was of minor impor- 
tance. This is reassuring for the stability of useful selection by means of 
the aptitude score. While there were seemingly enormous variations in 
validity from school to school and from time to time, amounting to & 
spread from .21 to .71, those variations may be regarded as due mostly 
to sampling errors. Incidentally, this example shows just how much 
obtained correlation coefficients may deviate from the population param- 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 193 


eter even with samples as large as 195. Any single obtained coefficient 
may be anywhere in the range of such a distribution, but the saving 
feature is that extreme deviations are highly improbable and small ones 
most probable. These illustrations should demonstrate more clearly some 


150 ‘Expected distribution 
| of. 
3 
5100 
3 
yy —— 
È 


LObrained distribution 


50 


0 Jt, 
40 45 5.0 55 6.0 6.5 7.0 7.5 
Pilot stanine (Aptitude score) scale 


60 Expected 
distribution of 


è 8 


Frequency 
w 
is} 


i distribution 
(o =0.088) 


0 0.20 0.30 0.40 0.50 0.60 0.70 0.80 
Validity (correlation) coefficient scale 

Frc. 9.3.—Distribution of expected and obtained sample means, also of expected and 
obtained validity coefficients, in connection with 269 samples (class groups) of AAF pilots 
in primary training during a five-month period in about 60 different schools. Espe- 
cially to be noted is that the obtained distribution of means was much wider than ex- 
pected, indicating nonrandom sampling, while the distribution of validity coefficients 
was about as expected, indicating random sampling. This is possible because two 


different kinds of sampling are involved. 


of the practical uses of standard errors, as well as the importance of ran- 
dom sampling if we are going to make accurate and useful interpretations. 

When Observations Are Not I ndependent.—Random sampling also implies 
independence of observations. In the preceding examples, observations 


were not independent because certain restricting conditions tied cases 


194 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


together; if one student was chosen to go to a certain school at a certain 
time, one or more others like him were also chosen with him. Thereare 
other situations where this occurs, many times without the investigator’s 
being aware of it. It is most likely to occur when sampling is obtained 
from subgroups of the population. 

Suppose we have an experiment in which there are 10 subjects and 
each has 10 trials in each experimental session. For each session we do 
not have 100 independent observations. Nor do we have merely 10 obser- 
vations. Because there are individual differences, the 10 observations in 
each set will be somewhat homogeneous, having been derived from a 
single source. In the larger setting of the 100 observations, they are not 
independent. In computing ao. for the mean of these 100 observations, 
the number of degrees of freedom is not 99. It is difficult to say just 
what it should be. The most conservative approach would be to assume 
10 observations, each being the mean derived from one individual, and 
9 degrees of freedom. But this would lead to an overestimate of the 
standard error. In the situation described, we have what is called cluster 
sampling. For a special treatment of this subject which includes formulas 
for estimating oy, the reader is referred to a discussion by Marks." 

When Populations Have Been Stratified —Another instance of sampling 
from subgroups of a population is that of stratified-random sampling. 
The effects of grouping in this instance are in the opposite direction of 
those mentioned above. Stratifying tends to stabilize the dispersion of 
sample means and of other statistics, preventing their scattering as much 
as would be true of a completely random sample. Consequently, the or 
derived in the ordinary manner is an overestimate. Such a standard error 
is therefore a conservative estimate of statistic fluctuations. 

Certain corrective procedures have been developed for the case of 
stratified-random sampling. The most general and serviceable formula is 


pee m (Standard error of mean corrected for stratifica- (9.8) 


id tion in sampling) 


N 


where a? = variance in the total sample. 


Rs ae 
o°m = variance among the means of the subgroups. 


Each subgroup is a sample representing a stratum, within which there has 
been random sampling. It should be pointed out that the variance om is 
a weighted affair, that is, the contribution of each set of data to the vari- 
ance is in proportion to its size. The formula for this is 


1 Marks, E. S. Sampling in the revision of the Stanford-Binet Scale. Psychol. Bull., 
1947, 44, 413-434. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 195 


Tm = + [N(M — M) + N(M: — M) +--+ + Ni(M. — M) 


(Weighted variance of $ 
means of sample sets) (9.9) 


where Vi, Ne, . . . , Nx = numbers of cases in sets 1 to k, respectively. 
N and M refer to the total, composite sample. ý 

For further discussion of this topic, and additional formulas, the reader 
is referred to an excellent treatment of the general subject of sampling 
by McNemar.! 

The Size of Population In previous paragraphs we have assumed that 
populations are of infinite size; at least that they are extremely large as 
compared with the size of sample extracted. In some situations the total 
population may be finite, and not many times larger than the sample. 
This restriction means that successive samples would have in them many 
more cases in common, and this leads to greater similarity in means. If 
the size of the population is known, we can take it into account in esti- 
mating és and hence obtain a more realistic figure for it. A serviceable 


formula is 


ea S o EE N (Standard error of mean corrected for (9.10) 
eas a/N —1 Np size of population) 


in which Wp is the number in the total population and other symbols are 
as previously defined. It can be seen that as Vp becomes very large com- 
pared to WN the correction factor under the radical at the right approaches 
1.0 and the standard error of a mean is then estimated by the customary 
formula. When the sample contains 490 of the population, the value of 
the factor at the right reduces to .995 and the standard error is only one- 
half of one per cent lower than it would be without the correction. 
Matching of Samples.—In some investigations, there is restriction in 
sampling brought about by matching. Experimental and control groups 
are often equated in some respects while studying the effect of some varied 
condition upon a measured outcome. Groups are frequently “equated” 
for such matching variables as chronological age, mental age, IQ, socio- 
economic status, or for initial score on some particular task or test. As in 
the case of stratified sampling, it pays to match samples only on variables 
that are correlated with the measured variable—the variable on which 
we note the experimental outcome. The matching may be by pairs (e.g., 
for every individual of a certain kind in the experimental group there is a 


similar one in the control group) or by total group (assuring that the 


1 McNemar, Q. Sampling in psychological research. Psychol. Bull., 1940, 37, 331- 


365. 


196 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


means, standard deviations, and skewness are practically the same for 
the matching variable in the two groups). It is logical that if we try to 
keep successive samples constant with respect to the mean on some 
variable positively correlated with the experimental variable, the means 
on the latter will also be kept more constant depending upon the extent 
of that correlation. The standard error of a mean should then be smaller 
under this restriction. The general formula is 


oy = a VI m (9.11a) 


(Standard error of a mean corrected 


-n for effects of matching samples) 
aý je (9.118) 


where 72 = correlation between the matching variable and the experi- 
mental variable. 

Inspection of formula (9.11a) will show that the first factor, e/V N — 1, 
is the customary standard error. What the second factor, VI — Pins 
does is to modify, by lowering, the size of the standard error. The larger 
r becomes, the greater is the correction effect. The correlation has to be 
as high as .866 in order to make the correction as much as .50, in which 
case the standard error is half as large as it would be without matching. 
The same change in ox could be accomplished by increasing the size of 
sample four times with random sampling. When rmz is .707, the reduc- 
tion is equivalent to that obtainable by doubling the size of sample in 
random sampling. This gives some idea of the economy of measurement 
to be achieved by matching samples. 

If the matching has been done on the basis of more than one variable, 
the correlation called for in formula (9.11) is the multiple correlation (see 
Ch. 16) between a combination of the matching variables and the experi- 
mental variable. Matching on the basis of many variables does not 
ordinarily pay unless the matching variables are themselves relatively 
independent, t.e., uncorrelated with each other. Adding one more match- 
ing variable to several may not increase the multiple correlation and 
hence not lower the standard error. 

Sometimes a sample group is matched on the same variable, as when we 
give it a pre-test and a post-test, with intervening experience or practice. 
In this case, the paired cases are identical individuals. The variability of 
means to be expected from successive sampling of this kind is indicated 
by the following estimate of the standard error: 


=z (Standard error of a mean for matching a 
ii group on the experimental variable (9.12) 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 197 


in which rzz = the test-retest reliability (see Ch. 17) of the experimental 
variable. The reader who is familiar with the reliability statistics 
described in Ch. 17 will recognize the product ¢ 1/1 — rz as the standard 
error of measurement of individuals. Dividing by degrees of freedom 
should indicate similarly the dispersion of means of measurement.! 

The Reliability of a Median.—The variability of sample medians is 
about 25 per cent greater than the variability of means when the popu- 
lation is normally distributed. Under this condition the standard error 


of a median can be estimated by the formula 


1.2530 
Man = —— 
"O A/N 
in which exa» stands for the standard error of a median. As applied to 


the ink-blot test data, 
(1.253)(10.45) _ 13.09385 


O Mdn 50 7.071 


Two-thirds of the sample medians of ink-blot scores, when NV equals 50, 
in samples drawn at random from the population will be expected within 
1.85 units of the population median. Since the population is normally 
distributed, by assumption, we may also say that the sample medians 
would not deviate from the population mean more than 1.85 units, two- 
thirds of the time. The median may thus be used as an estimate of the 
population mean, but with less confidence than we have in the use of the 
sample mean for the same purpose. 


Tur RELIABILITY OF OTHER STATISTICS 


(Standard error of a median estimated from ø) (9.13) 


1.85 


The Standard Error of a Standard Deviation.—The standard deviation 
will also fluctuate from sample to sample. For a given size of sample, 
the sampling distribution of ø is somewhat skewed for small samples but 
approaches the normal form so closely for large samples that we can draw 
inferences about a sample ø, knowing its standard error. This SE is 


estimated by the formula 


aas (Standard error of a standard deviation) (9.14) 


7e = 2N 


Applied to the ink-blot data, 


= WE = 1.045 


+/ 100 
1 For further discussion of standard errors in matched and other restricted samples 
C., and Van Voorhis, W. R. Statistical procedures and their mathe- 
New York: McGraw-Hill, 1940. Pp. 132-135. 


oa 


see Peters, C. 
matical bases. 


198 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


We can now say that the odds are 2 to 1 that the sample ¢ will not deviate 
more than 1 unit (1.045 should be rounded to 1.0) from the population ¢. 
We can also say that the odds are about 19 to 1 that a sample ¢ will not 
deviate more than 2 units from ¢. 

Comparing formula (9.14) with formula (9.2) for the standard error of a 
mean, we can see that a population standard deviation is more accurately 
estimated than a population mean, when we compare them as to sam- 
pling errors. The denominators of these two formulas contain the values 
2N and (N — 1), respectively, which means that the cx is usually more 
than 40 per cent greater than øe. For the inkblot data, the two standard 
errors are 1.045 and 1.49, respectively. In one sense it is fortunate that 
the standard deviation is more stable than the mean, because both ow 
and o, are estimated from it. 

When there are departures from ordinary random sampling—from 
stratified populations, finite populations, or from matched samples— 
corrections in the estimate of øe are in order just as they are in connec- 
tion with ou. By analogy to formulas given above for ox, one can make 
the appropriate corrections in se. In general, the occasion for computing 
a s is a rare event. The need for making a correction in one of these 
special situations is even more rare. But where called for, as in the case 
of øx, such a correction may make a real difference in conclusions drawn. 


The Standard Error of Q.—The standard error of the semi-interquartile 
range is estimated by the formula 


_ -18670 
= A/N (Standard error of Q estimated from e) (9.15) 


TQ 


when the population distribution is normal. Applied to the ink-blot data, 


_ (.7867)(10.45) _ 8.221 
e oo. ton H 


If the standard deviation is not known, the next best procedure is to use 
the formula 


_ 1.1660 
= JN (Standard error of Q estimated from Q) (9.16) 


TQ 


This substitute formula is possible because in a normal distribution 
Q = .6745c. Applied to the ink-blot data, this formula gives 


ee (1.166) (7.5) 4 
@ 9/50 1.24 


ee 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 199 ' 


The slight discrepancy between og as estimated by these two formulas 
may be due to the fact that the sample distribution was not quite normal 
and hence Q did not equal exactly .6745c, or to minor irregularities in fre- 
quencies in class intervals that were crucial for the estimation of Q. 

The interpretation of og is comparable to that of other standard errors 
already encountered in this chapter, 7.e., in terms of degree of confidence 
that the sample Q could deviate certain distances from the central value 
of the sampling distribution. 

The Reliability of a Proportion.—Data in terms of frequencies, per- 
centages, and proportions are so common in psychology and the social 
sciences that the problem of their reliability is very important. Each 
obtained proportion is a sample statistic and, as such, it may be expected 
to fluctuate from sample to sample. Out of 100 students quizzed at ran- 
dom, the proportion of them who reported the habit of reading a daily 
newspaper is .65. How well does this proportion represent the student 
population? Assuming that we have a random sample, there is a way of 
estimating how such a proportion of 100 observations might be expected 
to vary. The standard error of a proportion measures this variation, and 
with a known or assumed form of distribution of the sample proportions we 
can arrive at conclusions as to the accuracy of the obtained result. 

The standard error of a proportion is given by the formula 


öp = Zi (Computed standard error of a proportion) (9.17a) 
VA 
where f = the proportion of the population who are in the category 
selected. 
q = the proportion of the population who are not in the category 
q@=1-?)- 


N = the number in the sample. 
We ordinarily do not know the parameters p and ğ. The practical 


solution is to use the sample $ and q as the best estimates we know for 


those values. 
The useful formula is therefore 


op = 2 (Estimated standard error of a proportion) (9.178) 
I 

The total outcome of formula (9.17) depends relatively more upon 

the size of V than upon p and g because the product pg remains fairly 

constant between .20 and .25 for quite a range of values of p (namely, 

between .27 and .73) and because in most cases the sample p will not 

be very divergent from p. As p goes outside the limits of .27 to .73 and 


200 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


as it approaches 0.0 or 1.0, the divergence of p from p becomes smaller 
and smaller. If one has better knowledge concerning the population 
Ë, which is provided by other information, for example a p from a larger 
sample or from a series of prior samples, one could use some other esti- 
mate of asa hypothesis. One could arbitrarily choose some hypotheti- 
cal p derived upon the basis of a priori reasoning. This approach will 
be given more attention in Ch. 11 on “Testing Hypotheses,” so will not 
be discussed further here. 

For the newspaper-reading data suggested above, where p is .65 and N 
is 100, the standard error is therefore estimated by formula ` 


-65)(.35 Fy mr 
i £65)(-35) = \/.002275 = .048 


The interpretation of this result, as usual, depends upon an assumption 
about the form of the sampling distribution. The sampling distribution 
of p approaches the normal form if N is not too small, and if p is not 
too close to .00 or to 1.00. It must be stated by way of qualification 
that as Ê deviates from .50 in either direction the distribution of p becomes 
skewed. This is because no p can fall below 0.0 or go above 1.0. Dis- 
tributions are curtailed at those extremes but can extend greater distances 
in the opposite direction. As samples become very large, however, dis- 
persions become so narrow that these terminal restrictions have less 
importance. As a practical rule for avoiding seriously nonnormal sam- 
pling distributions of p, some statisticians recommend that we forgo 
estimating øp, or at least interpreting it, when the product Np (or Nq, 
whichever is smaller) is less than 10.t Thus, if V is as small as 20, only 
one proportion could qualify to meet this tule, namely p = .5. For small 
samples greater than 20 there is less restriction, but some. For example, 
if N = 40, only proportions between .25 and .75 could qualify for meeting 
normal-distribution standards under this rule. There are other methods 
of dealing with cases that do not come under this rule.2 

The obtained øp in connection with the newspaper data is .048, or 
approximately .05. Since the conditions for normal distribution of the 


isfied, we can say that the odds are 
ortion is not further than .05 from 
gin of error in the proportion of .65 
confidence limits, we may feel much 


the population proportion. Our mar 
may be stated as .05. Enlarging our 


1Treloar, A. E. Elements of statistical reasonin, 
180. 
? Ibid., Ch. 12. 


g- New York: Wiley, 1939. P. 


THE RELIABILITY AND SIGNIFICANCE OF STA TISTICS 201 


more certain (odds about 19 to 1) that this obtained proportion is not 
more than .10 away from the population value. 

The Proportion as a Mean.—tIn connection with the question of reli- 
ability of a proportion it is interesting to know that in one important 
sense the proportion is actually a mean and its standard error is actually 
the standard error of a mean. A numerical example will illustrate this 
point. 

Suppose we have administered a certain test item to 100 individuals, 
of whom 80 give the correct answer and 20 do not. Let each successful 
person receive a “score” of 1 and each unsuccessful person a “score” of 0. 
That is actually what we usually do in scoring a test composed of items. 
Each item may be regarded as a subtest on which the range of scores is 
usually 2 units. We need not confine this reasoning to responses to test 
items. Wherever events can be classified into a certain category or not, 
we can arbitrarily give a value of 1 to all cases in the category and a 
value of 0 to those not in the category. Other examples might be possess- 
ing a habit of reading a daily newspaper versus not having the habit; 
being an alcoholic versus not being an alcoholic; voting for candidate X 
versus not voting for candidate X; and so on. In terms of probability, 
the value of 1.0 stands for absolute certainty of an event’s occurring and 
zero stands for absolute certainty of its not occurring. A proportion can 
thus be regarded as an average probability. 

Returning to the test-item problem, the mean score for the 100 indi- 
viduals is the sum of the scores divided by the number of them, in other 
words, 2X/N or =fX/N. The sum of the scores is 80 and N is 100, 
from which the mean is .8. This is also the proportion passing the item. 
Thus our proposition that the proportion is a mean is demonstrated. 

To find the standard error of a mean, as by formula (9.2), we need to 
know the standard deviation of the sample. It can be shown that for a 
distribution in two categories the variance is equal to the product pq 
and the standard deviation is equal to y/q. This is demonstrated in 
Table 9.3. This table shows both the numerical solution for this par- 
ticular illustrative problem and also the general solution in terms of 
symbols. From the table it should be clear that the variance equals pq 
and the standard deviation equals +/pg. Using the latter as an unbiased 
estimate of the population standard deviation, by substitution for ø in 
formula (9.2) we have »/pq/+/W, or ~/pq/N, which is formula (9.170) 
for the standard error of a proportion. Having used the obtained p as 
an unbiased estimate of the population Ê, since it is a mean, we need not 
be concerned with loss of degrees of freedom here. Consequently the 
denominator in formula (9.17) is ~/N rather than VN — 1. 


202 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 9.3—CoMPUTATION OF THE MEAN AND STANDARD DEVIATION FOR A 
DISTRIBUTION IN Two CATEGORIES 


Numerical example Solution with symbols 
X| f | Xx x fx? f IX x fx? 
1| soj 80| +0.2| 3.20|Np Np a | Nba? 
0} 20 0 | —0.8 | 12.80 | Na 0 | —p | Npa 
Sire Ae 100 | 80 16.00 | Np + Na = | Np | — | Nbr + Np = 
N(p+a) = Noa(p + 4) = 
N Npa 
Mean.. sessseeeeee -80 16 ? pa 
(M) (e?) (M) (a?) 
Standard deviation. 4 ba 


The Standard Error of a Percentage—If we wish to work in terms of 
percentages instead of proportions we may do so. Let the percentage be 
denoted by P and let Q equal 100 — P. Remembering that a percentage 
is 100 times its corresponding proportion, the standard error of a percent- 


age will be 100 times as large as that for the proportion. The formula 
reads 


op = 100 fl =. a (Standard error of a percentage) (9.18) 


The Standard Error of a Frequency.—A frequency, or the number of 
cases in a certain category, is equal to N times p, the proportion; con- 
sequently the standard error of a frequency is N times that for a pro- 
portion, and we have the formula 


of = N a = VNpq (Standard error of a frequency) (9.19) 


Out of 30 students who attempted a certain test item, 18 succeeded 
and 12 failed. How much confidence can we have that the 18 successes 
represent the actual success rate for the larger population these 30 stu- 
dents represent? The standard error, assuming a population Ê equal to 
.60, by formula (9.19) is equal to 1/30 X .6 X 4 = V7.20 = 2.7. This 
obtained frequency may therefore be presumed not to deviate more than 
2.7 from the average frequency to be expected if we had examined the 
entire population in samples of 30, with a degree of confidence that can 
be expressed as a 2 to 1 bet. With a degree of confidence expressed by a 
19 to 1 bet, we could say that we do not expect that this obtained fre- 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 203 


quency departs more than 5.4 from the average frequency we would get 
from many such samples. 

Standard Errors of Proportions When Sampling Is Not Completely 
Random.—When sampling has been stratified or clustered or when popu- 
lations are restricted in size, we need to make corrections analogous to 
those already proposed for standard errors of other means. This holds 
also for standard errors of percentages and of frequencies. Corrections 
for the latter can be obtained because of their relations to the former, 
as indicated above. 

When there has been stratification, the standard error of a proportion is 
estimated from the formula 


pq o (Standard error of a proportion corrected for 
om - NN aa stratification) (9.20} 


where p = proportion observed in the entire sample, all strata combined. 
g= Lap 
N = number in the total sample. 
o2m = weighted variance of the several strata proportions about the 
total sample proportion, p. 
The solution for o°» needed in formula (9.20) is given by the formula 


on = A [Wilds — 2)? + Nala — PP + + + Napa — 94 


(Weighted variance of sets 
of sample proportions) (9.21) 


, Na = numbers of cases in the different strata, 
respectively, there being & strata. 
Di, Py o eca P= proportions observed in these various strata. 
N and p are as defined above. 


It can be seen that if the various strata proportions all equal p, or nearly 
so, the variance o2m is zero, or nearly so, and the standard error ep is the 


same as in purely random sampling. If the variance between strata is 
not zero, formula (9.20) will give a smaller øp than would be obtained 


where Vi, Ne, o- 


from formula (9.17). 
If the population is of finite size and not too many times as large as 


the sample, the following formula will provide an improved estimate of op: 


(Standard error of a proportion corrected 
Tp = 24 ( =, Z) for size of population) (9.22) 


in which N is the size of sample and Np the size of the population. This 


204 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


formula is clearly analogous to formula (9.10) for the similar correction 
of om. , 

When samples have been matched on the basis of some outside variable 
correlated with the categorical variable on which the proportion is based, 
by analogy to formula (9.11), 


(Standard error of a proportion corrected 9.23 
for effects of matching) (9.23) 


in which rmz = correlation between the matching variable X,, and the 
experimental variable. 
The correlation would be a point-biserial or a phi coefficient (see Ch. 13). 
Reliability of a Coefficient of Correlation —Like every statistic, the 
coefficient of correlation is subject to errors of sampling. Let us say 
that in a certain population the parameter correlation, 7, is equal to .30.1 
From this population we take successive samples of 50 pairs of observa- 
tions each. The sample 7’s will fluctuate in a sampling distribution 
around the population value, both above it and below it. An example 
of this has already been reported in Table 9.2, where 7 was .53. How 
much variability may we expect? We need a standard error of r and 
some knowledge of the form of sampling distribution in order to say. 
Sampling Distributions of r—The sampling distribution of correlation 
coefficients is not of a uniform shape. It depends both upon the size of r 
and the size of sample. It is already known to the reader that the limits 
of z are —1.0 and +1.0. An obtained coefficient cannot exceed those 
limits. Consequently, as the population r approaches those limits, the 
sampling distribution becomes more and more skewed; negatively skewed 
for positive 7’s and positively skewed for negative 7’s. Only when the 
population 7 is approximately zero is the sampling distribution expected 
to be symmetrical (see Fig. 9.4). For large samples, however, one need 
not worry very much about skewness in practice when 7 is within the 
limits of —.80 to +.80. The larger the sample, the narrower the dis- 
persion of 7’s, and consequently the less restricting effect provided by the 
limits of —1.0 and +1.0. It is conceivable that with enormously large 
samples, with standard deviations of .01 or 02, even when v is .90, the 
sampling distribution could be regarded as symmetrical with negligible 
discrepancies. On the other hand, even when 7 is zero, if the sample is 
very small (under 25) it is not safe to base interpretations upon the 


1 Some authors use the symbol p (rho) to stand for a population correlation. Since 
this is also the symbol used for a rank-difference correlation it seems unwise to use it 
here. The use of 7, while not at all common, is consistent with a 


nalogous symbols— 
Mf and 3, for example. 


‘ 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 205 


assumption that the sampling distribution is normal, for reasons which 
will be left to the discussion of small-sample statistics. 

An Estimate of o,—We can estimate the standard error of r by the 
general formula 


i= (Standard error of a product-moment coefiicient of (9. 24) 


VN —1 correlation) 


This formula would be more accurate if we wrote 7 instead of r. There 
is little risk in using r as an estimate of the population parameter that is 


a 


—— 


— + 
-0.25 0 +0.25 +0.50 +0.75 +1.00 
| Scale of T 


| -1.00 =075 =0.50 


= | = E 

-3.0 =2.0 =1.0 0 +1.0 +20 +30 
Scale of Z 

Fic. 9.4.—Distributions of sample coefficients of correlation when NV is very small and 

when the population correlations are .00 and .80. Corresponding to them are distribu- 

tions of Fisher’s z coefficients. Conversion of r to z brings about symmetrical sampling 

distributions, regardless of the size of r. 


really needed if samples are large and if 7 is large. Examination of the 
formula will show that for the same size of sample, o, is largest when 
r = .00 and becomes smaller as r approaches —1.0 or +1.0. The size of 
i the standard error itself indicates to some extent the amount of risk we 
take in letting r stand for the population value. 
' To illustrate formula (9.24) with the case of a population 7 first, let us 
take the values mentioned above—with 7 = .30, and N = 50. We have 


Interpreted, this means that with a population 7 equal to .30, we may 
‘ expect two-thirds of samples 7’s, when V = 50, to lie within .13 of the 


206 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


parameter 7, in other words, between .17 and .43. We also might expect 
95 per cent of the sample 7’s under these conditions to be between .04 
and .56, these values being 20, distances from .30. There would be only 
one chance in 100 that sample 7’s could deviate as much as .335 (this 
being equal to 2.580,) from the population value. This much deviation 
marks off the range from —.035 to .635. We should not be too sure of 
these interpretations involving the extreme tails of the distribution, since 
departures of the sampling distribution from normal form would show up 
most at those places. But it can be seen how even negative coefficients 
might arise by random sampling occasionally, even when the population 
correlation is as large as .30. The smaller the y and the smaller the 
sample, the more likely are these reversals of algebraic sign of correlation 
to occur. 

Consider next the case when we must substitute an obtained r for the 
parameter 7 in the use of formula (9.24). Let us use the obtained corre- 
lation of +.61 from the problem in Table 8.5. 


ia oe 

oP aT = 1 

4 — 3791 
/86 

6279 

= 9.2736 

= .068 


It is sufficient to report ¢,, as for most standard errors, to two significant 
digits. From the result we may say that whatever the population 7 may 
be (and it is probably not far from .61), an obtained r such as .61 would 
not deviate from it by more than .068 with a confidence indicated by 
odds of 2 to 1. There are less than 5 chances in 100 that in samples of 
this size the sample r would depart more than .136 from the population 
value, and less than 1 chance in 100 that the sample r would depart more 
than .175, above or below it.” The obtained r, consequently, seems 
securely placed in a region that is removed from zero or negative 
correlations. 

The Significance of Small r’s When r’s are small, ż¿.e., in the region of 
zero but either positive or negative, our interest should usually center on 
the question as to whether such values could have arisen when the popu- 
lation correlation is actually zero. In the previous illustrations we were 
more concerned with the accuracy of determination of the amount of 
correlation. Incidental to that problem we saw that some sampling dis- 
tributions could come close to zero if not extend beyond it. This becomes 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 207 


a very serious problem when coefficients are numerically small and sam- 
ples are not large enough to fix the boundaries of sampling fluctuation 
definitely clear of zero. 

The best approach to the small y is to assume that the population 
correlation is actually zero and then ask whether, with the size of sample 
being what it is, the obtained r could have occurred merely by random 
sampling. Our being able to conclude whether the obtained r represents 
any genuine correlation at all depends upon this kind of test. Inci- 
dentally, assuming that the population r is zero is one form, or one appli- 
cation, of the sll hypothesis of which we will hear much more later on. 
Our working hypothesis is that there is a sl amount of correlation. 
Since formula (9.24) implies the use of the population 7, we may insert 
any value for it that we please (except +1.00, which would shrink e, to 
zero). Any r we chose to insert would be our hypothesis about the amount 
of correlation. We could then compute o, and test the hypothesis by 
seeing whether the obtained r deviates too far from 7 to be reasonable. 
A deviation that goes outside the practical limits of the normal distribu- 
tion would of course be very unreasonable. A deviation that is so large 
as to occur by chance only a very small proportion of the time would also 
be seriously questioned. 

When the population 7 is zero, the standard error is estimated by the 
formula 

= 1 (Standard error of 7 when the population 7 is (9.25) 


fre ALN A assumed to be zero) 


This formula will apply satisfactorily when NV is not less than 25, and 
certainly when it is not less than 50. Applying this formula to the data 


of Table 8.5, 


The obtained correlation, .61, is more than 5 times as large as this stand- 
ard error. So very rarely could this much correlation occur by random 
sampling in a population where X and Y are actually uncorrelated, that 
we can reject the null hypothesis and say that almost certainly there is 
positive correlation. We would not ordinarily make this test of a coef- 
ficient as large as .61 unless the sample were quite small. Even if the 
sample were 26, in which case or, = .20, this obtained correlation would 


be af least 3 times the standard error. 


208 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The t-ratio Test of r—The test of the null hypothesis, as was just 
illustrated, lies in the examination of the ratio of an obtained 7 to er. 
In a normal distribution this ratio is, of course, a standard measure, Zz. 
When we are dealing with sampling distributions, however, the custom is to 
give this ratio a new symbol, f, and to speak of the ratio as a ż ratio. In 
general, t is defined as the ratio of a deviation to a standard error. In this 
case we are dealing with the deviation of an obtained r from a (assumed) 
population 7. The population 7 is assumed to be the mean of a sampling 
distribution, thus the ż ratio is interpretable as a standard measure in 
relation to the normal distribution, when samples are large. When sam- 
ples are small, as we shall find later, we have other distributions to take 
its place. 

How large a correlation, or how large a £, is needed in order to lead 
us to reject the null hypothesis? There is no single standard for rejec- 
tion. The reason is that values of ¢ are on a continuous scale (see Fig. 9.5) 
and all we can do is to note the probability of so large a ¢ occurring by 
chance. The smaller that probability, the more inclined we are to doubt 
the null hypothesis, if not to reject it. If we reject the null hypothesis, 
the chief alternative we have is to believe in a population correlation dif- 
ferent from zero. ‘This is one of the chief virtues of a statistical test of 
significance. The situation is reduced to two alternatives—either the 
null hypothesis, in this case, or some other. There is either some corre- 
lation (r not zero) or there is not. If we reject the null hypothesis with 
considerable confidence, we have strong reason to accept the other. 
Beyond this, as we shall see, there is weakness in this statistical test; 
if we do not feel justified in rejecting the null hypothesis we are not 
thereby forced to accept it. We can find evidence from the £ test favor- 
ing the rejection of a null hypothesis but if we cannot reject it the out- 
come is inconclusive. These points will become clearer in subsequent 
discussion. We will proceed one step at a time. 

7 Confidence Levels—The larger the t, the less likely it is that it could 
occur by random sampling. There is general agreement that when / is 
as large as 1.96 (in normal sampling distributions) we may regard ¢, and 
the deviation for which it stands, as “significant.” In a normal distribu- 
tion, a £ that deviates more than 1.96 (in either direction) from the mean 
would occur only 5 times in 100.* This criterion is often referred to as 

1The 5 per cent is equally divided between the two tails of the curve, i.¢., areas 
of .025 in each tail. Figure 9.5 shows this fact, also the fact that for the 1 per cent 
level of confidence we are talking about .005 of the area in each tail. The reason why 
both tails are included is that with a symmetrical distribution it is as likely for a ¢ of a 


certain size to occur in one direction asin the other. In other words, it is not the direc- 
tion but the size of t that matters. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 209 


“significant at the .05 level of confidence,” or as “significant at the 5 per 
cent level.” We could reject the null hypothesis with confidence that 
only 5 times in 100 would we be wrong in so doing. We would be wrong 
if we rejected the hypothesis when actually it happened to be true, i.e., 
there was actually a population correlation of zero. A more confident 
criterion of rejection requires a / as large as 2.58, at which value there is 
less than 1 chance in 100 that a ¢ as large or larger could have occurred 
by chance. With such a / obtained from a sample, we could reject the 
null hypothesis with the confidence of being wrong only once in 100 times. 
This is confidence at the .01 or 1 per cent level. Other levels are often ` 
mentioned by some investigators who regard the 5 per cent and 1 per cent 
levels insufficiently refined as criteria. The other levels, along with these 
two, are summarized in Table 9.4.1 


TABLE 9.4.—CRITERIA OF SIGNIFICANCE OR CONFIDENCE LEVELS OF ¢ IN A NORMAL 


DISTRIBUTION 
Level of ¢ Level of Confidence Rough conclusion 
Below 1.65...... Below .10 or 10% level Insignificant 
i eee At the .10 or 10% level Insignificant 
196. ines At the .05 or 5% level Significant 
2.33......| At the .02 or 2% level Significant 
RDE sarees At the .01 or 1% level Very significant 
2 BE EES At the .005 or 0.5% level Very significant 
Above 2.81...... Beyond the .005 or 0.5% level | Very significant 


By reference to Table 9.4, we note that if an obtained ¢ were equal to 
1.80, the discrepancy between hypothesis and fact is said to be significant 
between the 10 per cent and 5 per cent levels of confidence. If a ¢ is 
2.75 we would say that it is significant between the 1 per cent and 0.5 per 
cent levels of confidence. A # of 1.80, or any value below 1.96, however, 
is ordinarily regarded as “insignificant.” We would not reject the null 
hypothesis unless we could do so at least beyond the .05 level, though 
that is an arbitrary custom. A ¢ greater than 1.96 but less than 2.58 is 
often reported as being “significant” and a ¢ of 2.58 or greater as being 
“very significant.” An investigator may choose any level of confidence 
he prefers. But he must defend it. It may depend upon the kind of 
problem being investigated and upon the seriousness of being wrong in 
concluding either for or against the null hypothesis. At any rate, it is 
best practice for the investigator to decide upon the level oi significance 


1 Confidence levels are also called fiducial limits by R. A. Fisher and others. 


210 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


he is going to require in advance of knowing any statistical results, lest 
he be biased by such knowledge in this decision later. 

Errors in Making a Statistical Inference—There are two chances of 
coming to wrong conclusions. One is setting the level of confidence 
required so low that there is danger of rejecting the null hypothesis when 
it is actually true. The statisticians call this an error of the first kind. 
The probability of making this kind of an error is as small as the proba- 
bility of a ¢ of the size of the criterion which was adopted occurring by 
random sampling. Thus, if / is significant at the 2 per cent level of confi- 
dence and we reject the null hypothesis, the probability that we would 
be wrong in doing so is .02, or 2 chances in 100. If, however, £ is sig- 
nificant at the 10 per cent level and we reject the null hypothesis, we 
have one chance in 10 of being wrong. An error of the second kind is in 
accepting the null hypothesis when it is false. The danger of this error is 
increased if we put the criterion too high. There is no easy method of 
determining the probability of this kind of error. We can reduce the 
chances of it by lowering the level of significance required for rejection. 

Another logical point must be emphasized in connection with the ¢ test 
of significance. Decision not to reject the null hypothesis does not neces- 
sarily prove that it is true. We have already seen that rejection of it 
does not necessarily prove that it is false. We only have degrees of confi- 
dence (no final proof) that we are correct in rejecting it. If an r deviates 
from zero so little that the ¢ is below the 5 per cent criterion, we do not 
reject the idea that the population correlation could be zero but this is 
not the same thing as saying that the correlation is zero. It could be 
actually anything within the range marked off by chance deviations as 
determined by a standard error. There is even more justification for 
saying that the population 7 is the same as the obtained r, in the absence 
of any other information, than for saying that it is actually zero. The 
point is that when any investigator obtains a very small v, if he takes it 
to mean actual correlation, it is incumbent upon him to show the improb- 
ability of such an 7 arising from variables uncorrelated in the population. 
He can still maintain that the two variables are correlated in his sample, 
and he would be right, because the least we can say about an obtained 7, 
as about any statistic, is that it describes something about a particular 
sample. But using a statistic, including r, to describe a population calls 
for supporting evidence. 

The ¢ ratio will be encountered many times again. It is particularly 
useful in testing the significance of differences of many kinds. Its inter- 


1 See Deemer, W.L. The power of the ż test and the estimation of required sample 
size. J. educ. Psychol., 1947, 38, 329-342. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 211 


pretation in small samples will also receive attention later in this 
chapter. n 

Minimum Significant r’s——A more convenient and practical procedure 
for determining whether an obtained coefficient of correlation is signifi- 
cantly different from zero is provided by the Wallace-Snedecor tables (see 
Table D, Appendix B). In the first column of the table are given the 
number of degrees of freedom available for the coefficient. In each corre- 
lation problem the number of degrees of freedom is N — 2. The number 
of observations is a pair of values, one in X and one in Y. One degree of 
freedom is considered lost in the computation of each mean, the mean of 
X and the mean of Y. Both the products of the moments and the two 
standard deviations are affected by the loss of two degrees of freedom. 
This leads to some bias in the sample 7 as an estimate of the parameter 7, 
incidentally, but it is inconsequential unless V is small. 

Having located the proper number of degrees of freedom in Table D, 
we find in the second column two values. One is the minimum 7 that is 
significant at the 5 per cent level, and the other, in bold-face type, is the 
minimum 7 significant at the 1 per cent level. If we are satisfied with 
these gross criteria for rejection of the null hypothesis regarding corre- 
lation, this procedure will do. If we want greater refinement or other 
standards, we would use formula (9.25), or, in the case of small samples, 
formula (9.38). One advantage of the use of Table D for this purpose is 
that it takes care of small samples as well as large samples. The mini- 
mum 7’s listed were derived on the basis of formula (9.38) which will be 
discussed under small-sample statistics. 

Examination of Table D shows that for samples with 1,000 degrees of 
freedom r must be at least .062 to be significant at the 5 per cent level. 
An r of .062 or larger, positive or negative, could arise by chance when r is 
zero only 5 times in 100. If we reject the idea that the population corre- 
lation is zero, we have 5 chances in 100 of being wrong. For the same 
size of sample, an 7 of .081 is required for significance at the 1 per cent 
level. Thus, if we obtained a correlation of .10 (either positive or nega- 
tive) we could feel very confident that there is some relationship between 
X and Y and that it is in the direction indicated by its algebraic sign. 
Since we feel confident that there is some degree of correlation present, 
we might then apply the usual formula for ep (9.24) in order to get an 
idea of its probable limits. 

Thus, even very low coefficients, like .10, may indicate a relationship, 
but it takes a very large sample to establish that fact and to determine 
On the other hand, some obtained coefficients of 


its probable value. 
rge size may be very uncertain indicators of any 


moderate or even la 


212 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


relationship at all. Note that when N is 10 (8 degrees of freedom), the 
minimum 7’s required are .632 and .765, at the 5 per cent and 1 per cent 
levels, respectively. Even if our obtained 7 exceeded these limits when 
N is 10, the exact level of the amount of correlation would be exceedingly 
uncertain. Correlations derived from such small samples are practically 
worthless, unless they are of the order of .90 or higher. 

Fisher’s z coefficient-—Because of the numerous radical departures of the 
sampling distribution of r from normal form, and the limitations to our 
interpretations that result from this, R. A. Fisher has developed another 
statistic into which any obtained y can be converted by formula and 
which has a normal sampling distribution even with very small samples. 
This statistic has been called z, which we will write in bold face to dis- 
tinguish it from the standard measurement s. They are definitely not 
the same statistic. Fig. 9.4 shows distributions of r’s and of correspond- 
ing z’s on their respective scales. 

The range of Z is from — œ% to +, but when r reaches the value .995, 
z is still short of the value 3.0. Up to an r of .25, z and r have approxi- 
mately the same value. Even when r = .50, z is no larger than .56. 
Within these limits, then, distributions of r can be regarded as normal. 
Above this range, when normal distribution is an important consider- 
ation, it would be well to convert r to z. This conversion formula is 


z = }sllog. (1 + r) — loge (1 — r)] (Conversion of a coefficient (9.26) 
of correlation into 
Fisher’s z) 
in which log. stands for a logarithm to the base e, or refers to the use 


of the Naperian system of logarithms.’ In terms of logarithms in the 
common system, 


z = 1.1513 [logio (1 + r) — logio (1 — 7)] (Same as formula (9.26) (9.27) 
in terms of common 
logarithms) 

For general practice, Table H (Appendix B) may be used for the con- 
version of r to z and of Z to r. One would not report final results in 
terms of Z, but would finally convert back to v. 

The standard error of z, unlike that for r, is of practically uniform size 
for all values of z. It can be estimated by the formula 


1 
= VE (Standard error of z) (9.28) 


oz 


1 For the benefit of the mathematically sophisticated student, z 


i ic arc 
tangent of r, or Z = tanh“! y, is the hyperbolic a 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 213 


The interpretation of any estimate of øz is like that for any other stand- 
ard error. It may be used to mark off confidence limits on the scale of z 
which can be referred back to corresponding 7’s. 

The chief uses of z are to be found in problems of averaging coefficients 
of correlation (see Table 13.13) and in testing the significance of differ- 
ences between 7’s (see p. 224) when 7’s are large and sample sizes are not. 


THE RELIABILITY OF DIFFERENCES 


Of much more practical value than the standard errors of means, pro- 
portions, and the like are the standard errors of differences between means 
and between proportions and the like. In experimental practice, we are 
perpetually comparing measured results under two conditions that we 
arbitrarily set up. We ask such questions as to whether the eye is more 
sensitive during stimulation of other sense organs or in the absence of 
such stimulation; whether boys or girls are more capable in a test of 
perceptual speed; whether one method of teaching subtraction is superior 
to another in terms of resulting efficiency. This calls for one set of 
measurements under the one condition and another set under the other 
condition and a comparison of means. The statistical question is, “How 
reliable is the difference between means?” 

The Standard Error of a Difference between Uncorrelated Means.— 
Again reliability is indicated by a standard error. The amount of fluctu- 
ation in a difference between sample means is naturally related to the 
amount of fluctuation in the means themselves. The simplest relation- 
ship is given by the formula 


ae 2 2 (Standard error of a difference between un- c 
Odu = VOM + 07a correlated means) (9.29) 


where ow, = SE of the mean of the first distribution. 

om, = SE of the mean of the second distribution. 
This relationship holds only when the two sets of measurements are inde- 
pendent, i.e., uncorrelated. When we are dealing with matched groups, 
for example, particularly when individuals are matched pair by pair, the 
formula will have to be enlarged. But more of that later. 

Let us apply formula (9.29) to a typical problem. A group of 114 men 
and a group of 175 women were given the same word-building test in 
which the score is the number of words built out of six letters in 5 min. 
The results are given in summarized form in Table 9.5. The women’s 
mean of 21.0 is 1.3 points higher than that for the men. This mean 
difference is very small numerically, but in view of the relatively large 
number of cases in the two samples, we should expect the obtained means 
to be very close to the true means, and perhaps therefore it indicates a 


214 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 9.5.—MEANs AND OTHER STATISTICS IN THE COMPARISON OF MEN AND WOMEN 
IN A WORD-BUILDING TEST 


Statistic Men Women 

N 114 175 

M 19.7 21.0 

o 6.08 4.89 

om 572 371 

ee eee 

Cay .682 

Du 1.3 

t 1.91 
ee 


real sex difference. The stability of each mean is indicated by its SE, 
which is .572 in the case of the men and .371 in the case of the women. 

Just as sample means are distributed normally about the true mean 
when N is large, the sample differences between means are also distributed 
normally. The central tendency about which the differences between 
means fluctuate is a population value. We do not know what that popu- 
lation value is. We are most concerned, first, in determining whether 
there is any difference at all, and second, in determining its approximate 
size. The statistical tests connected with differences, in principle, are 
very much like those we encountered in connection with correlation coef- 
ficients. Since most differences are small (if they were not, we should 
hardly need to make statistical tests) we first make a test to see whether 
we are justified in rejecting the null hypothesis. The null hypothesis in 
this case is the supposition that in the population there is no real differ- 
ence. Stated in another, and more acceptable; way, the null hypothesis 
is that the two sample means arose by random sampling from the same 
population. Same, that is, with respect to the variable measured; the 
two groups from which the two samples were drawn are obviously differ- 
ent in other respects, otherwise we would not have raised any question 
of a difference at all. 


distribution of differences, with the mean at zero, or at Ñ, — Ñ, = 0.0. 
The deviation of each sample difference, 


ence point is equal to (M, — M.) — (WZ, — M), or M,:—M,—0. The 


terms of a formula, è 
Mı — M. i ; 
Pest See (A ¢ ratio for a difference between means) (9,30) 


Ody 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 215 


The numerator, to be quite complete, should read Mı — M, — 0, as was 
stated above, but since the zero has no contribution to make to the compu- 
tation, it is dropped in ordinary practice. It will help the investigator 
using this formula to think more clearly if he remembers that logically 
the zero belongs there. 

Figure 9.5 shows graphically a sampling distribution of ¢ ratios. This 
distribution is real, though rarely derived by using actual data, because 
every difference we obtain by random sampling, with N’s constant, pro- 
vides its own ¢ value. We could actually take a series of 100 paired 
samples, compute Mı — Mp for each pair, cay for each pair, and conse- 


Lower extreme 
0.025 of the 
expected ts 


Upper extreme 
0.005 of the 
expected t's 


.0 +1.0 +20 
Scale of t-ratios 
M-i is negative M,-Mz is positive 

Fic. 9,5.—A sampling distribution of ¢ with a mean of 0, which corresponds to a hypo- 
thetical difference between means equal to zero. Shaded areas show the regions of 
extreme /’s; at the left those significant at the 5 per cent level and at the right those 
significant at the 1 per cent level. Obtained ?’s (cither positive or negative) in those 
extreme regions are interpreted accordingly. 


quently a ¢. The frequency distribution of the 100 ¿s we could set up 
from those data would look like Fig. 9.5. 

Testing the Null Hypothesis —For the word-building test we have the 
information (see Table 9.5) that the difference Mı — My is —1.3. The 
algebraic sign of the difference does not concern us at this time; we are 
interested only in its amount. The standard error oa, = .682. From 
this, 

13 
.682 
= 1.91 


The value 1.91 tells us how many ca,’s the obtained difference extends 
from the mean of the distribution. The mean, under the null hypothesis 
which is being tested, is a difference of zero. Since the sample is large, 
we may assume a normal distribution of the /’s and interpret the obtained 


taccordingly. It fails by just a little to meet the 5 per cent level of confi- 


216 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


dence (which for large samples is 1.96); consequently we would not reject 
the null hypothesis and we would say that the obtained difference is not 
significant. There may actually be some difference, but we have not 
enough assurance of it. There are more than 5 chances in 100 that a 
difference as large as this one, or larger, could have happened by ran- 
dom sampling from the same population—same with respect to word- 
building ability. A more practical conclusion would be that we have 
insufficient evidence of any sex difference in word-building ability, at 
least in the kind of population sampled. Note that the conclusion was 
not stated to the effect that we have demonstrated that there is xo sex 
difference in word-building ability. We cannot prove the truth of the null 
hypothesis; we can only demonstrate its improbability. 

Had the ¢ test turned out very significant, i.e., with less than 1 chance 
in 100 that by chance a ¢ could be so large, we would then have been 
interested in the size of the difference. Our interest would then have 
reverted to the standard error of the difference and the probable limits it 
suggested for the size of the difference. This procedure is so similar to 
that for determining the probable size of any population parameter that 
we need not go through the steps here. 

The Standard Error of a Difference in Correlated Data.—When the data 
are so sampled that there is a correlation between the means in the two 
variables measured, i.e., so that the means in pairs of samples tend to rise 
or fall together (positive correlation) or tend to be contrasting so that 
when one rises the other falls (a negative correlation), the SE of a differ- 
ence is estimated by the formula 
(Standard error of a differ- 


Ciu = Von +m — 2ri90 3146 Me ence between correlated (9.31) 
means) 


which is like formula (9.29) except for the last term, in which ry. is the 
correlation between the two sets of means. 

Fortunately, under the usual circumstances of random sampling, the 
correlation between the two sets of means is approximately equal to 
the correlation between two sets of single measurements in two samples. 
Since we ordinarily have only two samples with two means from which 
we could not compute 712, this fact is a great convenience. But in order 
to compute the correlation between single measurements, we must have 
the individual measurements in the two samples paired off two by two 
in some manner. For example, if the same group of students takes the 
same word-building test twice instead of two different groups taking it, 
we have the same individual’s score in the first trial to pair off with his 
score in the second trial. Or if, in comparing males and females in the 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 217 


test, we want to standardize our two groups better by taking a brother 
and a sister from each family or if we pair boy with girl with respect to 
age, IQ, or social status, or all such factors, then if these factors of com- 
mon family, common age, JQ, or social status have any relation to 
word-building score, they automatically introduce correlation into the 
two samples. We compute a coefficient of correlation in the manner 
described in Ch. 8 and introduce it into formula (9.31). 

In Table 9.6, we find two sets of knee-jerk measurements, both from 
the same 26 men but under two conditions. In the first case (7), the 
subjects were squeezing a hand dynamometer just before the stimulus 
struck the knee, and in the second case (R) the “relaxed” knee jerk was 
obtained under a relaxed, sitting posture. Will the average man show a 
real difference in height of knee jerk under the tensed condition, as theory 
would lead us to expect? The two means, with a difference of 3.39 deg., 
suggest that the theory is vindicated. But we want to be sure that this 
large a difference could not have happened by random sampling from a 
population of measurements in which the actual difference is zero. 

If we were to assume no correlation between the tensed and normal 
measurements of knee jerk, we should apply formula (9.29), or we should 
apply formula (9.31) with an riz equal to zero, which is actually the same 
thing. Such a ca, turns out to be 2.37 deg. of arc. The ¢ ratio is 
3.39/2.37, or 1.43. This ¢ falls decidedly short of the 5 per cent level of 
significance. We should conclude, erroneously, that although there is 
some difference in the expected direction, it is not a significant one. So 
far as these indications go, we would not be called upon to reject the 
null hypothesis; the difference of 3.39 could represent merely a result of 
random sampling. 

When we compute a coefficient of correlation between the two sets of 
measurements, we find it to be +.82. This means that the men came 
rather closely in the same rank order in both the tensed and the relaxed 
conditions. If a man has a high kick under normal conditions, he will 
be likely to have a correspondingly high kick during the tensed conditions. 
If a man is low in the one case, he is likely to be low in the other. If the 
sampling is random, there would be a similar correlation between means 
under the two conditions. If another group of 26 men had a higher 
normal average response than this one, it would be likely also to have a 
higher average tensed response. When means rise and fall together, they 
ain the same difference between them. In the case of a 
perfect positive correlation (r = +1.0), the difference between means 
would remain exactly constant. If all the sample differences between 
means were identical, their dispersion would be zero, and, oa, would equal 


tend to maint 


218 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 9.6.—STRENGTH OF THE PATELLAR REFLEX UNDER Two CONDITIONS, TENSED 
AND RELAXED, FOR 26 MEN, AND DIFFERENCES BETWEEN THEM 
(Measurements Are in Terms of Degrees of Arc) 


T R T—R 
Tensed Relaxed | Difference 

31 35 =4 

19 14 + 5 

22 19 +3 

26 29 =: 

36 34 +2 

30 26 i ls 

29 19 +10 

36 37 ar! 

33 27 +6 

34 24 +10 

19 14 FS 

19 19 0 

26 30 -4 

15 7 +8 

18 13 +5 

30 20 +10 

18 i +17 

30 29 +i 

26 18 +8 

28 21 +7 

22 29 -7 

8 4 +4 

16 11 +5 

21 23 <3 

35 31 +4 

‘ 26 31 -5 

Zz 653 565 +88 
M 25:12 21.73 3.39 
o 7.A7 9.45 5.50 
oM 1.43 1.89 1.10 

area | 


zero. We would then be almost certain of a 
direction. A correlation of +.82 is less t 
still some room for variability among the 
of reasoning just completed, we can see that the 
than it turned out to be when we assumed an r 


true difference in the obtained 
han 1.00, however; so there is 
differences. But from the line 
Tay iS going to be smaller 
equal to zero. ` 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 219 


By the use of the complete formula (9.31), we find the ca, to be 1.10, 
which is less than half the previous estimate of 2.37. The # ratio is now 
3.39/1.10 = 3.06. A ¢ above 3 is obviously in the “very significant” 
category.* 

We therefore feel very confident that there is a real difference in favor 
of the tensed conditions. This is not saying that we feel sure that the 
true difference is exactly 3.39; it might be more or less than that. At 
any rate, the hypothesis with which the experiment started receives sub- 
stantial support from the result. 

Observations Should Often Be Paired.—In setting up an experiment with 
two groups of subjects or two groups of measurements for statistical com- 
parison, it is well to pair off cases two by two if possible, so that a corre- 
lation can be computed. Often when such pairing is not actually carried 
out, there would still be correlation between means of samples anyway; 
the full formula for the SE of a difference cannot then be applied, and 
the ca, by formula (9.29) is overestimated. It is true that under these 
circumstances, if the correlation is positive, as is usually the case when 
there is correlation, we can say that the correct ca, is smaller and that 
the correct ¢ ratio is larger than the one we estimated. When we have a 
significant or very significant ¢ ratio under these circumstances, we can 
be sure that the ¢ we would obtain by taking into account the positive 
correlation would be even larger. But one difficulty is that when the 
1 ratio obtained under these circumstances is too small to be significant, 
we cannot conclude anything in particular. Least of all can we conclude 
that the true difference is probably zero, for had we considered the corre- 
lation, we might have found a significantly large ż ratio, The process of 
matching and the inclusion of the correlation factor in the cay formula are 
said to increase the precision of the t test. By this is meant that the test is 
more sensitive to a difference when it is real. As a result, we are more 
likely to avoid the error of accepting the null hypothesis when it is 
incorrect. 

In pairing off individuals or observations, it is important that the pair- 
ing be done on some significant basis. It will not pay to do any pairing 
except on the basis of some trait that correlates with the measurements 
on which the two groups are going to be compared. For example, if we 


1 A sample of 26 pairs of observations would be regarded as a “small sample” by 
many investigators. In this case, a larger ¢ would be required for meeting the confi- 
dence levels of significance (see Table 9.4). In this problem, with 25 degrees of freedom, 
a t of 2.79 is significant at the 1 per cent level (see Table D). The obtained ¢ (3.06) 


also exceeds this limit. We have W — 1 degrees of freedom in matched data, where 


N is the umber of pairs of observations. 


220 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 
g 

were to compare two groups of boys as to ability to do a high jump, one 
group after training of a certain kind and the control group without such 
training, it would be important that the two groups be equated as to age, 
among other things. Ability in the high jump, regardless of training, 
would be dependent upon age, hence correlated with it. But the ability is 
probably not correlated significantly with grade earned in arithmetic; so 
there would be no point in matching the groups on this variable. 

The basis upon which to match groups having been decided, there are 
two common ways of carrying out the matching. One is by pairing cases 
directly. In the problem just mentioned, for every boy of ten years six 
months in the one group, one would seek a boy of like age in the other. 
Small discrepancies may well be permitted at times between pairs. If 
there are about twice as many cases in the one sample as in the other, 
matching two boys to one would be the solution. The other common way 
of matching groups is to ignore individuals as such and simply to try to 
make sure that the two samples have approximately equal means, stand- 
ard deviations, and skewness. When this is done and the two variables 
are correlated, the formula for the standard error is! 


TE eat aA Pana) (Standard error of a differ- (9.32) 


ence for matched samples) 


in which fmz is the correlation between X,, (the variable on which the 
groups were matched) and X (the variable on which we are testing the 
difference). If the groups are matched on the basis of two or more vari- 
ables, a multiple correlation coefficient is involved (see Ch. 16). 
Comparison of formula (9.11) with formula (9.32) will show that they 
are alike. The former corrects ox for the effects of matching while the 
latter corrects simultaneously the two variances of means that enter into 
the formula for cay. This should be sufficient warning that if the correc- 
tion has previously been made in each oy, the correction given in formula 
(9.32) should not be used; that would be applying the same correction 


A Standard Error of a Difference Obtained Directly from Differences.— 
When individuals have bee 


V a n paired off, we can find the desired statistics 
directly from differences between pairs. In Table 9.6, we find the differ- 


ence in knee-jerk measurements (T — R), given with algebraic signs, for 
every individual. If we sum them and divide by N, we obtain the mean 
of the differences, which is equal to the difference between the means- 
If we calculate the SE of the mean of these differences, we have Tay. The 
Gay is thus obtained in the most direct manner, We do not even need to 


1 McNemar, Q, Psychol. Bull. 1940, 37, 331-365. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 221 


know the SZ’s of the two means or the amount of correlation present, 
yet our direct procedure has taken these things into account. The cay 
for the knee-jerk data obtained in this manner is identical with that which 
we found previously, as it should be. The interpretations and conclusions 
concerning the mean difference are the same as usual. This more direct 
method is very strongly recommended whenever it can conveniently be 
applied. 

The Reliability of Differences between Proportions, Frequencies, and 
Percentages.—Consider the data in Table 9.7. Here we have the pro- 
portions of 400 men and of 400 women students who judged two words as 
“pleasant” or “very pleasant.” The two words were “to explore” and 
“symphony.” Here we can raise two questions concerning each word. 
Is there any sex difference in the proportion judging the word “pleasant”? 
And within each sex, is there a significantly greater proportion of “ pleas- 
ant” judgments for one word than for the other? The differences them- 
selves show that the men favor the word “to explore” slightly more than 


TABLE 9.7—Proportions or 400 MEN anp 400 Women WHO JUDGED THE Worps 
“no EXPLORE” AND “SYMPHONY” PLEASANT; DIFFERENCES AND STANDARD ERRORS 
OF DIFFERENCES; AND t RATIOS 


“to “sym- Differ- 
nti a Cd, t 
explore” | phony ence r 
8775 .6850 342 .1925 0234 8.23 
8700 8875 395 0175 -0180 0.97 


do the women, the difference in proportion being .0075. The women 
decidedly more often favor the word “symphony,” with an excess of 
.2025 over the proportion of the men who judge it pleasant. The men 
find the word “to explore” more pleasing than they do the word “sym- 
phony” by a margin of .1925, and the women, on the other hand, find 
the word “symphony” more to their liking than “to explore” by a small 
margin of .0175. Which of these differences, if any, are significant or 
very significant according to the rules we have been following? We can 
test any or all of them for statistical significance. 

The Standard Error of a Difference between Proportions —The standard 


error of a difference between two proportions is given by the formula 


(Standard error of difference o 33) 


oa, = Op F m — 2r 120m0 ps between proportions) 


222 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where op, = SE of the first proportion. 
Gp, = SE of the second proportion. 
7ı2 = correlation of proportions in pairs of samples. 

Again, it is fortunate for us that, when sampling is random, the corre- 
lation between proportions is equal to the correlation between single cases. 
The latter we can estimate from the data. In Table 9.7, we find that 
the correlation between men’s judgments of the two words is given as 
-++.342 and the correlation for the women is +.395, since both words were 
judged by the same individuals. But in the comparison between sexes, 
there was no pairing of individual judgments in any known way; so we 
may assume that the correlations are zero. On this basis we find the 
oa, between men and women for the word “to explore” to be .0235. 
The obtained difference of .0075 here yields a £ ratio of 0.32, which is 
decidedly not significant. The sex difference on the word “symphony ” 
gives a ca, of .0281, which yields a ż ratio of 7.21. This is so far above the 
limit for “very significant” deviations that we are very confident about 
its being true that college women (like those in the sample) find “sym- 
phony” more pleasant than do college men (like those in the sample). 
Men also decidedly prefer “to explore” to “symphony,” with the highly 
significant ¢ value of 8.23. Women, however, who find “symphony” 
more pleasing than “to explore” by an excess of -0175, do not give any 
sure indication that the true difference is in this direction, for the ¢ ratio is 
only 0.97. The results are somewhat in line with what we should expect, 
but it can be ventured that some differences that we expected to be true 
did not prove to be significant and perhaps do not exist at all; for example, 
where we might have expected a difference between sexes on “to explore,” 
a significant one failed rather decisively to appear. 

Differences between Percentages and Frequencies —Similar tests of sig- 
nificance can be made for differences between percentages and frequencies. 
The uses of percentages and frequencies are here completely analogous to 
the use of proportions as they have been in other connections. An illus- 
tration of how to test either of these differences will therefore not be given. 

The Reliability of Differences between Standard Deviations.—If we 
are concerned about differences: in variability in two distributions as 
measured by ø, we can also make statistical tests of significance some 
what like the ones already illustrated. The formula for the standard 
error of a difference between o’s is 

Co, = V Pa + 0702 — 291200002 CS ican a fg? ference (9.34) 


1 This correlation should be derived from samples asa @ coefficient, or the correlation 
of two genuinely dichotomous variables (see Ch. 13). y 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 223 


It is especially to be noted that the 712 in this equation, unlike its appear- 
ance in others, is squared, for it has been proved that the correlation 
between standard deviations in pairs of samples is equal to the square of 
the correlation coefficient between individual pairs of measurements; 
hence the squaring in formula (9.34). 

We may apply this formula to the data in Table 9.5 for the word- 
building test. Here we find the men more variable than the women by a 
difference of 6.08 — 4.89, or 1.19 points. Is this difference significant, or 
could it have arisen as a natural deviation from an actual difference of 
zero, i.e., equality of the sexes in variability? The ca, proves to be .476 
(the correlation being zero) and the ¢ ratio is 1.19/.476, or 2.50. The 
difference of 1.19 points therefore just fails to pass the hurdle of signifi- 
cance at the 1 per cent level. There is just more than one chance ina 
hundred that if the two sexes are equally variable in this test, such a large 
discrepancy between their standard deviations could have occurred by 
sampling. Just failing to “pass the hurdle,” however, should not be 
stressed too much. The amount of difference obtained is a very rare 
occurrence and strongly suggests the inference that there is a real sex 
difference in variability in the word-building test. 

Under the heading of small-sample statistics will be found a radically 
different method for testing a difference between two standard deviations. 
With small samples the test given above breaks down completely for lack 
of normal sampling distributions. 

Reliability of Differences between Coefficients of Correlation.—If we 
have two coefficients of correlation, 712 and 734, that have been obtained 
from intercorrelating two pairs of variables and we want to test whether 


they could have arisen from the “same population” by random sampling, 


by analogy to other formulas, the standard error of a difference between 


r’s is estimated by 
(Standard error of the dif- 
ference between two 


= 2 es coeficients of correla- (9.35) 
= Vom +9 PY marara ra : x 5 
Cdr ns chs ee So tion with no common 


variable) 


where cr, = the standard error of r12. 

= the standard error of rss. 

= the correlation between samples or 712 and rss. 

The estimation of the correlation of r’s can be made by means of a very 
long formula involving 11s, 714, 72s, and 72s, as well as riz and rs, which 
makes this procedure forbidding. With no variable in common to the 
two z’s being compared, it is likely that the r between 7’s will be rather 
small. When one of the variables in the riz correlation is very highly 


Ory 


Trissa 


224 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


correlated with one in the rz, correlation, however, the r, correlation 
would probably be of sufficient size to call for its use. 

The type of problem in which the average reader will be likely to test 
differences between 7’s is one in which one of the variables is common to 
the two correlations. This calls for a different correlation of correlations 
(see formula 9.36). For this reason the reader is referred elsewhere for the 
method of estimating 7,,,,,,.* Without using the correlation term 7,,, one 
can sometimes reject the null hypothesis with confidence, because ¢ is un- 
derestimated, but sometimes one could not feel very sure that he should 
not reject it if r, is of substantial size and is not used. 

In experimental investigations in which we study the change in correla- 
tion (perhaps reliability or validity) of a measuring instrument under dif- 
ferent conditions, one or both of the correlated variables is likely to enter 
into both correlations. We determine the validity correlation for a test with 
and without scoring weights using the same outside criterion. We com- 
pare the validity coefficients of two similar verbal tests, also against the 
same criterion. For such a situation we would be testing the difference 
between two correlations r12 and riz, where variable X, is common to both. 
If we substitute 71s for the correlation rs, in formula (9.35), we can esti- 
mate the standard error oz, for these two correlations. The correlation of 
the r’s would be rnan This correlation can be estimated by the formula 


_ tretis(1 — r? — r? — r? + 2r ier 13123) 
2(1 — r*12)(1 — 7°43) 


(Correlation between two r’s having 
one variable in common) (9.36) 


Fraris 723 


The z Test of Differences between r’s—Remembering that there are 
doubts about the use of standard errors of r’s when correlations are large 
and when samples are not large, it would be well to consider testing differ- 
ences between z coefficients instead. Unfortunately, no one appears to 
have found a way of estimating correlations between paired samples of 
z’s. We must therefore be limited to problems in which Tzz is very small or 
zero: as when the two correlations being compared arose from rather inde- 
pendent variables. 


With this limitation, the standard error of a z difference is 


1 (Standard error of a difference be- (9 37) 
Me — 3 tween two z coefficients) 


* Peters, C. C., and Van Voorhis, W. R. Statistical procedures and their mathematical 
bases. New York: McGraw-Hill, 1940. P. 185. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 225 


Consider two r’s, r12 = -82 and ris = -92. The corresponding Z coeffi- 
cients (from Table H) are 1.16 and 1.59. N, = 50 and Vz = 60, from 
which 


The ż ratio is equal to 

1.59 — 1.16 
197 

43 

-197 

= 2.18 


From this we would feel more confident than usual that the difference is 
significant at the .05 level or better, for had we taken into account a possi- 
ble positive correlation between z’s, the / would have been larger. 


SMALL-SAMPLE STATISTICS 


The distinction between large-sample and small-sample statistics is not 
an absolute one, by any means, the one realm merging into and overlapping 
so extensively the other. If one asks, “How small is N before we have a 
small sample?” the answers from different sources will vary. There is 
general agreement that the division, if there must be one, is in the range ' 
of 25 to 30. Some place it as low as 20 and others say that anything under 
100 isa small sample. The truth of the matter is that the needs for small- 
sample considerations increase as N decreases and they may become criti- 
cal somewhere below an N of 30. Sampling distributions depart from the 
normal form more and more as N decreases. This was first realized by 
W. S. Gosset, who published for many years under the mysterious name 
of “Student,” and it was later emphasized by R. A. Fisher, who has 
worked out many of the procedures. 

The Sampling Distribution of 4—For small samples, many statistics 
exhibit sampling distributions that depart from normality in various ways, 
as was indicated in connection with discussions of standard errors in earlier 
sections of this chapter. Distributions of correlation coefficients, propor- 
tions, and of standard deviations are often skewed. Another important 
change that affects distributions of differences particularly is a change in 
kurtosis. Kurtosis is apparent in the degree of “peakedness” of the center 
of the distribution. A normal distribution is called mesokurtic, which 
means neither very peaked nor very flat across the top. Curves tending 
toward rectangular form, more or less, are called platykurtic. Those more 


226 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


peaked than normal are called leptokurtic. The distribution of ¢ tends to 
be leptokurtic. Figure 9.6 shows a leptokurtic distribution compared with 
anormalone. The most important thing to notice is not the sharpness 

the center but the fact that the tails of the leptokurtic curve are higher 
than for the normal curve. The greater areas under the two tails mean 


Fic. 9.6.—Comparison of a normal distribution with a leptokurtic distribution when 
their means and standard deviations are approximately equal. 

that we would have to go out to greater deviations in terms of ø units in 
order to include the same proportion of area inside thé limits of those 
deviations. If we ask how many units one must go from the mean in both 
directions to include all except .05 of the area, the answer would be larger 
for the leptokurtic than for the mesokurtic distribution. 2a 


Frequency 


PEL Rye Se O +l) E2 F3 FA 
: Scale of t 
Fic. 9.7.—Student’s sampling distribution of £ for various degrees of freedom. As the 


toe methods in psychology. Towa Ciy. Tie wether Taig) Hes, D. Quantita- 

The smaller the number of degrees of freedom, the farther the kurtosis 
shifts from the normal form. This is shown by Fig. 9.7. With a very 
large number of degrees of freedom we have the normal curve. With 25 
degrees of freedom the departure from normality is so slight, except very 
near the mean (which does not matter for the ¢ test) and at the extreme 
tails, that we usually would not go far wrong in assuming normality- 


a ae 


a 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 227 


With 9 and 1 degrees of freedom; however, the ¢ distributions depart 
drastically from mesokurtosis. The figure lends support to the choice of 
25 as a lower limit to large-sample logic and practice. 

Confidence Limits in the t Distribution.—Refer again to the high tails of 
the ¢ distribution with small samples, in Fig. 9.7. The ¢ values required 
for significance at the .05 and .01, and other levels of confidence have been 
calculated. For the .05 and .01 levels the required ?’s are given in Table 
D, last column. For very large samples, the two #’s are 1.960 and 2.576, 
respectively. For a sample of 1,000 df the limits change in the third deci- 
mal place only. For 100 df there is a little change in the second deci- 
mal place. The limits with 100 df are 1.984 and 2.626. Rough limits, by 
rounding, of 2.0 and 2.7 would do very well even down to about 30 degrees 
of freedom. With only 10 df, however, ?’s of 2.23 and 3.17 would be 
required for the respective confidence levels. One could, of course, look 
up the ¢’s required for significance to suit the number of df in his particular 
investigation. With small samples this becomes imperative, if one is to 
make the proper inferences from ¢ tests. 

Fisher’s ¢ Formulas.—Fisher has provided several formulas designed for 
the computation of ¢ when samples are small. We will first note his ¢ for- 
mula in connection with a coefficient of correlation. 

For the Test of a Coefficient of Correlation.—If the population r is zero, 
or may be assumed to be zero as when we assume the null hypothesis for 
the purpose of testing it, the ¢ desired for this test is estimated by the 


formula 


(Fisher’s formula for ¢ in testing a coeficient of cor- 
relation) (9.38) 


where r = obtained coefficient of correlation 
N = number of pairs of observations from which y was computed. 


Applying this to an illustrative problem we considered earlier, where 


r = .30 and N = 50, 
48 


30 01 


30 V 52:15 


(.30)(7.26) 
2.18 


woud 


regard the obtained correlation as probably not repre- 
n correlation of zero, though we can reject the null hy- 
d the 5 per cent level of confidence. According to 
df we have here, the two required /’s are 2.01 and 2.68. 


We may therefore 
senting a populatio 
pothesis just beyon: 
Table D, with the 48 


228 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


For the Test of a Difference between Means.—When means are uncorre- 
lated, the / formula for testing their difference is 


M, — M: (Fisher’s ¢ formula for testing 
yi Ex?’ + zil k + a] the difference between (9.39) 


NitNe—2|| NNa means) 


where M, and Mz are the means in the two samples. 
=x*, and Zx? are the sums of squares in the two samples. 
N, and N: are the numbers of observations, respectively. 

The numerator is identical with that in formula (9.30), and as in that 
place, it should read M, — M, — 0, if it were written in full to represent 
the deviation that it is. The denominator as a whole is the standard error 
of the difference between means, as any ¢ ratio requires, but no doubt it 
appears very unfamiliar. In writing the Say in this form, Fisher has taken 
the null hypothesis quite seriously, as we should do if we are completely 
consistent. That is, if there is but one population there should be but one 
estimate of its variance. The variances in formula (9.29) are allowed to 
differ, as they come from two different samples. But if they came from 
the same population, any difference between them should be due merely to 
sampling errors. The first term under the radical in formula (9.39) is a 
single estimate of the population variance. The numerator of the term 
sums the sums of squares coming from the two samples. This is divided 
by the total number of degrees of freedom, which for this problem is 
Nı+Nı— 2. The same df could be found by summing (N, — 1) + 
(N2 — 1). The use of the second term under the radical has the effect 
of computing the variance of the mean of differences from the estimated 
population variance. 


When the two samples are of equal size, i.e, N, = No, formula (9.39) 
simplifies to 


M, — ( ratio for diff i (9. 40) 
= ratio tor difference betwee: i . 

f2x? + 2x, samples of equal size) Te ESBS GE SEO 

Ni(N; — 1) 


where WV; = size of either sample. 

When means of paired samples are not ind 

best formula to use for deriving ¢ directly fro 
` 


= 


ependent but correlated, the 
m sums of squares is 


Ma 
i= TS ~ (The: for differences between correlated pairs of (9.41) 
Dri means) 


NN — 1) 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 229 


where Ma = mean of the N differences of paired observations. 
xa = deviation of a difference from the mean of the differences. 

The procedure implied by this formula was actually applied earlier in 
connection with the knee-jerk data under two experimental conditions 
(see Table 9.6). The number of degrees of freedom to use in this case is 
N — 1, where N is the number of pairs of observations. For the knee-jerk 
problem there are 25 degrees of freedom, which indicate ?#’s of 2.06 and 
2.79 for the .05 and .01 levels, respectively. 

Differences between Means in Nonnormal Distributions—If there is good 
reason to believe that the population distribution is not normal but 
seriously skewed or bimodal, and especially if the samples are small, the 
usual ¢ test does not apply. For such a situation, methods developed by 
Festinger, and others, are probably the most suitable substitutes." Such 
problems are not sufficiently common to justify explaining those methods 
here. 7 

For the Test of a Difference between Uncorrelated Proportions—When the 
null hypothesis is assumed with regard to two observed proportions, 
Fisher recommends, again, that we use just one estimate of the population 
variance. This requires the use of a weighted mean of the two sample 
proportions. Formula (4.8), previously given, can be applied here. Itis 
repeated here to apply to the averaging of two proportions. 


ae Nip + Nope (A weighted mean of two sample proportions 9 42) 
B Ni FN: used to estimate a population proportion) ” 


The formula for ¢ is 


pi — po 


t = SS (At ratio for a difference between uncorre- (9,43) 
2- (Nit Ne lated proportions) 
\ 2de (CNN: 


where ĝe = 1 — pe. 
When the two samples are of equal size, i.e., Nı = No, if we let both 
equal N;, formula (9.43) simplifies to 


pe (pi — pe) (9.44) 


1 


i 


Since this formula is proposed as a “small-sample” device, the question 
may arise as to just how small a sample will be suitable for the application 


1 Festinger, L. The significance of difference between means without reference to 
the frequency distribution function. Psychom., 1946, 11, 97-105. 


230 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of the formula. Statisticians seem to have very little to say on this point 
or on the question of degrees of freedom in connection with this particular 
i test. It would seem to the author that we should be cautious about 
applying formula (9.43) to very small samples for the same reason, as was 
earlier stated, that the use of øp is of dubious validity when a small sample 
is combined with an extreme proportion. Particularly to be avoided 
is the application of this formula when p, or fz is 0.0 or 1.0. 

Differences between Correlated Proportions.—While formula (9.33) is 
general enough to take care of testing the significance of differences 
between proportions (also percentages and frequencies), in most instances 
when data are correlated there is a more economical procedure recently 
introduced by McNemar.! As stated in an earlier footnote, the correla- 
tion required by formula (9.33) is the ¢ coefficient or the correlation coeffi- 
cient with two categories in both X and Y. McNemar’s formula avoids 
the necessity for computing the standard errors of the proportions as well 
as the phi coefficient of correlation, but does require having the data in a 
form that can be used for computing ¢. 

For a genuine nonzero correlation to exist between the two samples, as 
usual, either the same individuals or objects must appear in both or there 
must be a pairing in some significant manner, as of twins, siblings, or 
experimental-control pairs. Suppose we have administered two test 
items to a sample of 100 students. Item I is answered correctly by 60 
of the group and item II by 70. Is item II actually easier than item I? 
In making the ż test to answer this question, we must definitely face the 
possibility of correlation between the two items and consequently between 
the two proportions. To handle this problem properly, we need to set 
up the data in the form of a four-cell contingency table, as in Table 9.8. 
At the left are the four frequencies of those who were correct on item I and 
either correct or incorrect on item II, and the frequencies of those who 
were incorrect on item I and either correct or incorrect on item II. At the 
right in Table 9.8 are given letter symbols to stand for the four categories. 


Using these symbols, McNemar’s formula, in modified form, reads“ 
_ Ëi ie 
t= Wir (t ratio for difference between correlated proportions) (9.45) 


It will help to assure the proper application of this formula to note that 
the symbols b and c stand for the discordant cases in the four-cell table; 
in this problem b and c stand for individuals who succeed in one item and 


1McNemar, Q. Note on the sampling error of the difference between correlated 
proportions or percentages. Psychom. 1947, 12, 153-157, 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 231 


TABLE 9.8—A Four-cELL CONTINGENCY TABLE OF FREQUENCIES OF STUDENTS WHO 
Passen oR Faren Eacu or Two Test ITEMS 


Frequency Table Symbolic Table 
Item II Item II 


Fail Pass Both 


Fail Pass Both 


55 60 b a+b 


Item I 
Item I 


40 d c c+d 


Both | b+d a+c N 


100 


fail in the other. It will also help to know that the difference b — ¢ 
divided by N equals the difference between pı and pə. It is therefore the 
difference between two obtained frequencies, i.e., b —¢ = Np — Nps. 
To find the difference that is being tested in the numerator of the ż ratio 
is not a new experience. The denominator, therefore, must somehow 
represent the standard error of a difference between frequencies (it would 
be N times the standard error of a difference between proportions) with 
the correlation taken into account. In this formula, too, there is implied 
but one estimate of the population variance and it is derived from an 


average of the sample proportions. 
Solving formula (9.45) as applied to the test-item data, we have 


gis —10ee S10, 


STe V ae? 


The difference we would infer to be significant between the .05 and .01 
levels. Item II is probably easier than item I. 
It is informing to see what the outcome would have been if we had 


applied formula (9.44), without taking into account the amount of inter- 


correlation. With estimated to be .65, 
10 


+= C65)(35) 
100 


-10 


232 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


From this result we would have concluded that the difference was 
insignificant. This demonstrates how a decision may be altered dras- 
tically when the correlation term in the standard-error formula is taken 
into account. Without it, we run the risk of making an error of the 
second kind; of not rejecting the null hypothesis when it is false. The 
correlation (¢ coefficient) between the two items amounts to +.58. 
The reader will find that if he lets o*,, = 0°, = ap,0>, = 002275, and 
substitutes these with the correlation of +0.58 in formula (9.33) he will 
come out with a ga, equal to .0439, which gives a £ of 2.29, which is near 
that obtained with McNemar’s formula (2.24). 

One restriction in the application of formula (9.45) is that b + ¢ should 
be 10 or greater. 

The F Test of Differences between Standard Deviations.—For small 
samples, the ¢ test of differences between standard deviations is not satis- 
factory, even with the availability of Student’s distribution for ¢. Instead 
of testing the significance of a difference between two o’s, we can test the 
significance of the ratio of the two variances that correspond to them. If 
we compute the ratio of the larger of two variances to the smaller of the 
two, the larger the difference, the further the ratio exceeds 1.00. The 
ratio is 1.00 when the two variances are equal. If the ratio of the variances 
is significant, the difference between the standard deviations is significant. 

More accurately stated, we do not find the ratio of the variances in the 
two samples. Instead, we find an estimate of the population variance 
from each of the two random samples and from these values compute the 
ratio. We assume the null hypothesis, that the two samples came from 
the same population, and we ask whether two estimates of that population 
variance could differ as much as the obtained ratio indicates. The ratio 
has been given the symbol F, and is computed from the formula 


0 


= larger vanançe (F ratio for testing a difference between two (9 46) 
smaller variance estimates of a population &) . 


Each of these estimated variances is computed by the usual method: sum 
of squares in the sample divided by the number of degrees of freedom. 
This application of the F test rests upon the assumption that the popula- 
tion is normally distributed. 

A small set of data will illustrate the operation of this procedure. 
Assume that two sets of scores, in one of which V, = 8 and in the other of 
which N: = 5, have sums of squares Xx% = 132 and 5x? = 26, The 
degrees of freedom are 7 and 4, respectively, so the estimated variances 
of the population, independently derived, are 18.86 and 6.5. The F ratio 
is 18.86/6.5, which equals 2.90. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 233 


The Distribution of F —In random sampling, the distribution of F ratios 
can be predicted from the mathematical relationships. Figure 9.8 
represents three distributions for the situations with certain combinations 
of degrees of freedom, all of them being very small samples. Especially 
to be noted is the marked skewness of the curves. The test of an F ratio 
is made only in the tail at the right, since all the ratios examined for 
significance are in that region. The probability of an F’s exceeding a 
certain value by chance is given by the area under the tail beyond that 
F value. 

Table F (Appendix B) gives the standard F limits that are significant at 
the .05 and .01 levels of confidence when there are different combinations 
of degrees of freedom in connection with each of the two variances in the 
ratio. For the problem above, the two degrees of freedom are 7 and 4, 


of, =8, dh=4 


Frequency 


35 4.0 


0 0.5 1.0 1.5 2.0 25 3.0 
Scale of F 


Fic. 9.8.—Sampling distribution of Snedecor’s F for various combinations of degrees 
of freedom. (After Lewis, D. Quantitative methods in psychology. Iowa City. The 


author, 1948.) 

respectively, for the larger and smaller (or numerator and denominator) 
variances. Looking into the appropriate column and row of Table F, we 
find that the two F’s for the two significance levels are 6.09 and 14.98, 
respectively. The obtained F does not even approach the former of these 
very closely. We therefore do not reject the null hypothesis and decide 
that so far as variance or variability is concerned the two samples could 
well have come from the same population. 

In the following chapter we shall see the F test extended considerably to 
the problems of analysis of variance. It is in that connection that the 
F test justifies the recognition that it deserves. The application demon- 
strated here is only one of many. 

Sequential Analysis.—There has been developed very recently a pro- 
cedure that enables the investigator to save considerable time and effort 
by testing for significance as he samples. Large differences are likely to 


prove significant with rather small samples. It would be wasteful of 


234 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


experimental effort to accumulate more cases than would be needed to give 
a very significant ¢ or F. When we have no advance information as to 
how large a difference is going to be, we do not know how large a sample 
will be needed. We could obtain a small sample, test the difference, and, 
if it proved significant, stop the experiment. If it did not prove signifi- 
cant, we would continue to add observations sampled in the same manner, 
then make another test, and soon. Eventually, the test goes in the direc- 
tion of one hypothesis or another. This principle is applied in the method 
known as sequential analysis. There is insufficient space to describe the 


method adequately here. The reader is referred to an original source on 
the subject.! 


Exercises 


Data 9A.—RESULTS FROM A TEST OF THE ABILITY TO NAME FACIAL EXPRESSIONS IN 
THE RUCKMICK PHOTOGRAPHS 


Statistic Men Women 
N 95 164 
M 21.1 22.0 
o 3.62 3.15 
Q 2.38 2.16 
Mdn 21.5 22.2 


Data 9B.—Quantity WRITTEN IN SENTENCE CONSTRUCTION FROM 10 SETS OF 3 
Nouns Eacu AND 10 SETS or 3 Verss EACH 
Measurement Is the Number of Sentences Written in a Limited Time. 
Were 55 Girls 
ee 
Statistic | Nouns Verbs 


Subjects 


M 24.7 22.8 
5.42 


ryv = .67 
eee 


_-1. Compute the standatd errors of the means for Data 9A, and interpret your 
results. i 


_7 2. Compute the standard errors of the means for Data 9B, and interpret your 


results. . 
3. Compute the standard errors of the medians for Data 9A, and interpret your 
results. 


4. Compute the standard errors of the standard d 


eviations in either Data 9A or 
Data 9B, and interpret your results, 


1 Wald, A. Sequential analysis. New York: Wiley, 1947. 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 235 


Data 9C.—NuMBER oF STUDENTS IN Two Groups Wuo Passep EACH or THREE 
ITEMS IN AN INTRODUCTORY PSYCHOLOGY EXAMINATION 


Group I Group II 

TMs paea ace eine 37 63 
Tein A cause ae Hea 24 26 

rap = 19 
TRIE cae acm apse 33 32 

rac = .32 
Hem (Gi on as ance 30 44 

rac =.25 


5. Compute the standard errors of the frequencies of passing students in Data 9C, 
and interpret your results. Do the same in terms of percentages and proportions. 
_—6: Compute the standard error of the difference in means for Data 9A and also 
for Data 9B, and test for significance. State interpretations. 

7. Compute the standard error of the difference between medians in Data 94, and 
interpret your results. 

8. Determine the reliability of the differences between standard deviations in 
Data 9A and 9B. Draw conclusions. 

9, Determine the reliability of differences between Groups I and II, Data 9C, in 
terms of frequencies, percentages, or proportions of correct responses. Interpret your 
results. 

10. Determine the reliability of the differences between proportions passing items 
A, B, and C for either Group I or Group II. Give your interpretations. 

11. Assume that Data 9A are in a stratified-random sample. Compute the SE of 
the mean for a combined sample on the basis of this assumption. Compute the SD of 
such a combined sample, using formula (5.20). Compute the SE of the mean from 
this SD and compare it with the other. Explain the difference. 

12. Assume that the same 55 girls as in Data 9B repeated the same test with the 
following means: 26.1 and 23.5, for nouns and verbs, respectively. The two SD’s were. 
5.12 and 5.04, respectively. The corresponding reliability coefficients (test-retest) 
were .87 and .75. What are the best estimates of the SE’s of the means in the second 
samples? 

13. Was there a significant gain in either the noun score or the verb score? Support 
your answer with evidence, reporting the major steps you took. 

14. The correlation between an interest score and degree of satisfaction in a certain 
vocational assignment was .33 in a sample of 102. Find c+, ø+, and the ż ratio for this 
finding. Interpret your results. 

15. Apply Fisher’s £ formula for a ca, to the following data: 


Ny = 11; Na = 26, Mr = 17.5, Ma = 14.8, 2x") = 44, and Ea = 65. 


Interpret your results. s r 
16. Test the SD’s in Exercise 15 for significance of their difference by making an F 


test. Interpret your results. 


CHAPTER 10 
INTRODUCTION TO ANALYSIS OF VARIANCE 


It frequently happens in psychological and educational research that we 
obtain more than two sets of measurements, each under its own set of con- 
ditions, and we want some indication as to whether there are significant 
differences among the sets. We could, of course, pair off two sets at a 
time, pairing each one with every other one, and test the reliability of the 
difference in each pair. The practical difficulty in this approach lies in the 
number of pairs to be examined when there are, let us say, 5 or more sets. 
Five sets mean 10 pairs; 6 sets mean 15 pairs; and 10 sets mean 45 pairs. 
There is always the possibility that none of the differences would prove 
significant. What we desire in meeting this situation is some procedure 
by which we can say in advance whether or not there are any significant 
differences. If the answer to such a preliminary survey is “Yes,” we can 
then examine pairs to see just where significance differences exist. If the 
answer is “No,” our search is over without further ado. 

The methods of R. A. Fisher, known as analysis of variance, are well 
designed to meet this kind of problem as well as other problems. The real 
problem here is to determine whether sets of data obtained under varying 
conditions are sufficiently homogeneous to be regarded as belonging to the 
same population. Whether or not we combine distributions into larger 
composite distributions sometimes hinges on the answer to this question. 
Fisher’s test of significance in connection with his analysis of variance is 
designed precisely to tell us whether sets of data are sufficiently different 
from one another for us to reject the hypothesis that they arose by random 
sampling from the same population. 


ANALYSIS IN A ONE-WAY CLASSIFICATION PROBLEM 


Total Variance in a Composite Sample.—At this time it may be profita- 
ble for the reader to review the topic of “standard deviation in a com- 
posite sample” treated in Ch. 5. At that place it was shown how the sum 
of squares of a composite distribution which is made up of a combination 
of several sets of measurements, or subsamples, is equal to the summation 
of sums of squares within subsamples plus the sum of squares of the sub- 


sample means around the composite mean. In terms of an equation (for- 
236 


INTRODUCTION TO ANALYSIS OF VARIANCE O37 


mula 5.17), which is repeated here, 
Exh = Exa + Exh + nada + md% (10.1) 


This equation is limited to the combination of two subsamples, A and B. 
It could be extended to include any number of subsamples, with a pair of 
terms like those for A and B added for each additional set of data. The 
first two terms on the right-hand side of the equation represent sums of 
squares within the subsamples. The deviations within each subsample 
are from the mean of that subsample, in this case, from means Ma and Ma, 
respectively. The symbols da and dy stand for deviations of the set means 
(Ma and Ms) from the composite mean (M,;). Each d? is multiplied by the 
number of cases in its set, because there are as many deviations of the size 
of d in each set as there are cases. They represent variations between 
samples (indirectly through variation from a common mean M,). It is 
as if each deviation X — M, were made up of two components, Xa + da 
in the one set and a, + d in the other set. The total sum of squares is 
likewise made up of two components; that from deviation within sets and 
that from deviations between sets and the common reference point, M:. 
Just as we have separated the sums of squares into two distinct sources we 
can also separate the variances into the same sources. This is one of the 
fundamental concepts in analysis of variance. 

Two Estimations of Population Variance.—While this illustration of the 
segregation of sources of variance into two components—that within sets 
and that between sets—is a useful basis for approaching the analysis-of- 
variance problem, we must hasten to take the next step. In the preceding 
chapter the point was repeatedly stressed that in testing significances of 
difference we had to think in terms of population variances rather than 
sample variances. We must get back to the general objective of analysis- 
of-variance procedures, namely, to test for significant differences between 
several sets of independently derived experimental samples to see whether 
they could or could not have arisen by random sampling from the same 
population. 

The next key principle in this connection is that we make two distinct 
estimates of the population variance, one derived from the within sum of 
squares and the other from the between sum of squares. If these two esti- 
mates are very similar we are inclined to accept the null hypothesis, that 
the sets of measurements did arise from the same population. If the two 
estimates differ sufficiently, #.c., to the extent that random sampling can- 
them, we reject the null hypothesis. The test- 
f variance follows the ratio method described 
preceding chapter. The criterion is 


not reasonably account for 
ing of the two estimates of var 
for the comparison of variances in the 


238 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the F-ratio test. Except in rare instances the “between” variance is 
greater than the “within” variance, but even when it is smaller the prac- 
tice is to define F as the ratio of the between variance to the within 
variance. 

Since we want estimates of population variance, to avoid biases we 
divide sums of squares by the degrees of freedom rather than by N. 
Remembering that one degree of freedom is lost in computing each mean, 
let us consider how many are left for the within and the between variances. 
Let us assume that we have & sets of n observations each. It is very 
convenient, though not essential, to have the same number of observations 
in each set and most experiments in which analysis of variance is to be 
applied are designed with that in mind. Within each set, from which the 
within sum of squares is derived, there is one mean from which we lose 
altogether k degrees of freedom. We could write this as k( — 1) or as 
N — k. If there were 10 sets with 8 observations each, we would have 
80 — 10 = 70 degrees of freedom, or 10(8 — 1) = 70. The between 
variance is estimated from the k means, which may be regarded as k 
independent observations. The mean of the composite is a mean of these 
k means and one degree is lost in this manner. This leaves k — 1 degrees 
of freedom for the between variance. For the 10 sets of 8 observations 
each we would have 9 degrees of freedom for the between variance. Com- 
bining the degrees of freedom for the two variances, within and between, 


we have 79. This checks with the number we would have if we combined . 


all the 80 observations in one set and computed one estimate of variance- 

Estimation of the Within and Between Variances.—Having determined 
how to find the sums of squares and also the degrees of freedom for the two 
estimates of population variance, we are ready for the formulas. They are 


š =n,d?, 
Between variance = == 
k-1 
ee r En,o?; Imo? 
Within variance = Ii = Shr 


k(n=1) N=ẸŁ 


The expression 70°, is equal to Dx*,, as was said before. We may there- 
fore substitute Zx’, in the last equation. And since in most practical 


application of analysis of variance the sets have equal ws, we may write 
the two equations 


1The temporary quotation marks for “between” variance and “within” variance 
here are merely for the purpose of calling attention to the fact that we are shifting 
meaning of those terms somewhat. Where they referred to sample variances before 
they hereafter stand for estimates of population variances, unless sample variance is 
specified. “Between” and “within” sums of squares will continue to refer to samples. 


INTRODUCTION TO ANALYSIS OF VARIANCE 239 


nid? 


Between variance = 7—7 (10.2) 
N : ee ee 
Within variance = ka-1) = Oe (10.3) 


The subscript may now be dropped from 2, since it is a constant through- 
out, and also from the d’s, but the subscript s is left on the x° to indicate 
that we are here dealing with the deviations from the means of the sets 
rather than from the grand mean of the composite. 
The Solution of an Analysis-of-variance Problem.—In Table 10.1, we 
have four sets of observations made by the same individual on the Galton 
TABLE 10.1.—Work SHEET FOR THE ANALYSIS OF VARIANCE IN Four SETS OF 


MEASUREMENTS ON THE GALTON BAR 
The Measurements (X) 


Set I Set II Set III Set IV 
114 119 112 117 
115 120 116 117 
111 119 116 114 
110 116 115 112 
112 116 112 117 
2X, 562 590 571 577 2,300 =X 
Ms, = 112.4 118.0 114.2 115.4 115.0 M: 
Deviations within Sets (xs) 
+1.6 +1. —2.2 +1.6 
+2.6 +2.0 1.8 +1.6 
—1.4 +1.0 +1.8 —1.4 
—2.4 —2. 8 —3.4 
—0.4 —2.0 —2.2 +1.6 
Squares of Deviations within Sets (x?,) 
2.56 1.00 4.84 2.56 
6.76 4.00 3.24 2.56 
1.96 1.00 3.24 1.96 
5.76 4.00 0.64 11.56 
0.16 4.00 4.84 2.56 
17.20 14.00 16.80 21.20 69.20 =x, 
Deviations of Set Means from Grand Mean (d) 
d —2.6 +3.0 —0.8 +0.4 
d 6.76 9.00 0.64 0.16 16.56 2d? 
nd? | 33.80 45.00 3.20 0.80 82.80 Bd? 
a 


240 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


bar. With a constant horizontal line of 115 mm., the subject adjusted 
another line to seem equal to it. The four sets were obtained under four 
different arrangements of conditions under which the adjustments were 
made. Is it likely that the observations all came by random sampling 
from the same general “population” of adjustments, or were there 
systematic differences among sets sufficient to say that the data are really 
not homogeneous? The following steps are followed in the solution of the 
type in Table 10.1: 


Step 1. Compute sums and means of the sets; also the grand total SX and 
the grand mean M.. 

Step 2. For every set, compute the deviations from the set mean Ms- 
These are equal to (X — M,). 

Step 3. Square the deviations within sets to find each «?,. Sum these to 
obtain 2x?,, the sum of the squares of deviations within sets. 

Step 4. For each set, compute d, which equals (M, — M)). 

Step 5. Square each d, and find »Dd?. 


With these calculations completed (see Table 10.1), we have the values 
we need for formulas (10.2) and (10.3). The 3x2, is 69.20, and the nd? 
is 82.80. Dividing these by the appropriate degrees of freedom, we 
obtain the variances. For this purpose, we set up Table 10.2. Listing 
first the degrees of freedom and sums of squared deviations for “between 


TABLE 10.2—Tue TOTAL VARIANCE IN THE GALTON-BAR Data SUBDIVIDED INTO TWO 
COMPONENTS 


Components Degreesof| Sums of Variance 
freedom squares 
Between sets. 3 82.80 27.60 
Within sets. 16 69.20 4.325 
Tia a eaan 19 152.00 
27.6 
F= 4325 7 6.38 


sets” and dividing, we obtain 27.60 as the variance contributed by the 
d’s. For the corresponding values for “within sets,” we find 4,325 as the 
variance contributed by the a,’s. The F ratio is 27.6/4.325, which equals 
6.38. The between variance is over 6 times as great as the within variance- 

The significance of an F ratio of this size is determined by reference tO 
Snedecor’s table (Table F, Appendix B). In using this table, we have to 
consider the two different degrees of freedom. For the larger variances 


INTRODUCTION TO ANALYSIS OF VARIANCE 241 


with 3 degrees of freedom, we look for the column in Table F that is 
headed (3). For the smaller variance, with 16 degrees of freedom, we 
look down the left-hand margin for the row headed (16). We must 
interpolate, since row (16) is not given, and thus we find that an F of 3.24 
is significant at the 5 per cent level and an F of 5.29 is significant at the 
1 per cent level; 7.¢., the odds are 5 to 95 that so large an F as 3.24 could 
have occurred in a really homogeneous population, and they are 1 to 99 
that an F as large as 5.29 could have occurred likewise. Our obtained F is 
greater than that for the 1 per cent level and so is regarded as very signifi- 
cant. We conclude that there are significant differences among our sets. 
The test does not tell us where those differences are or whether all of them 
or only one is significant. To determine this would require further search. 
We only know from the F test that some significant law of variation 
between sets does exist. Further examination is needed to tell us what 
the causes of difference are and where they lie. 

Making t Tests Following an F Test.—Suppose that F turns out to be 
significant and we want to know just where the significance in the data 
exists. Particularly if F is significant between the 5 per cent and 1 per 
cent levels, it is not likely that all of the means are significantly different 
from all other means. Even when F is significant at the 1 per cent level, 
if one mean stands out as very different from the rest and the others differ 
very little, it is not likely that all mutual differences are significant. We 
are often inclined, then, to test at least some of the differences between 
pairs of means. The best procedure of this is to apply Fisher’s formula 
(formula 9.39), for a # test in small samples. We assume the null hypothe- 
sis for each pair in turn as we test it. We could save ourselves unnecessary 
work by being judicious in starting the ¢ tests. For example, if F is just 
barely significant at the 5 per cent level, we might begin with the largest 
difference and proceed with other differences until one proves insignificant 
after which we forego further testing on the assumption that we have 
reached the probable limit. This would be safe, particularly if the sets 
have similar dispersions. If the F ratio is decidedly significant beyond 
the 1 per cent level, we might begin at the other end, with the smallest 
differences, and work up to the difference of a size that proved significant 
by £ test, assuming that all differences as large or larger are also significant. 

Some writers recommend that in making ¢ tests after an F test we make 
only one estimate of population variance for all pairs and that this estimate 
be the within variance used in making the F test. This hardly seems 
logical, for if the F test has already told us that we may of assume that the 


sets all could have arisen by random sampling from the same population, 


it is inconsistent to make one estimate as if it were one population, The 


242 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


procedure described above seems preferable, though requiring more 
effort. 

The Relation of t to F —When we are reduced to two sets of observations, 
as when we compare two means for significance, we can still make an F 
test. The between variance will have associated with it only 1 degree of 
freedom, and when this is the case it has been shown that F is equal to 
#, For this particular situation, when V; = Noe, the sum of squares for 
between-means variations is given by the formula 


‘ n(M, — Me)? (Sum of squares between means of two 
nid? = —y= samples of equal size) (10.4) 


To illustrate, let us take the largest difference between means in Table 
10.1. The two means are 112.4 and 118.0, and their difference is 5.6. 
Applying formula (10.4) we find 78.4 for the between sum of squares. 
The within sum of squares is a combination of 17.2 and 14.0 from Table 
10.1. With 1 degree of freedom for the between sum of squares, the 
between variance is 78.4. With 8 degrees of freedom within the sets, the 
within variance is 31.2/8 = 3.9. The F ratio is 78.4/3.9 = 20.10, which 
is well beyond that required for significance at the 1 per cent level, the 
latter being 11.26. It is an established fact that with 1 degree of free- 
dom for the between variance, F equals ¢%. If F is equal to #, ¢ in this 
problem must therefore be equal to 4.48. 

Let us check this by computing ¢ in the usual manner, using formula 
(9.40). By that approach, 


5.6 
(eae 
3(5 — 1) 
5.6 


1.25 
= 4.48 


It can be demonstrated mathematically that 2 = F under these conditions, 
if one starts by squaring both sides of formula (9.40). Comparison of 
Table D, last column (¢ values) and Table F, first column of F values, will 
show that for the same number of degrees of freedom within the sets 
R= 
Computation of Variances from Original Measurements.—Just as we 
can compute standard deviations, and so variances, from original measure- 
ments without computing separate deviations from the means, (see formula 
5.11) so we can calculate the necessary constants for an analysis of 
variance. Such an approach requires us to square the original measure- 


| 


INTRODUCTION TO ANALYSIS OF VARIANCE 243 


ments. With good calculating machines available, this is no large order, 
but with only pencil and paper it amounts to considerable labor. 
Fortunately, by a process of coding, we can bring the numbers down to 
small size. From each of the three-place numbers in Table 10.1, we may 
substract the constant of 110, leaving the remainders shown in the first 
part of Table 10.3. The variances will not have been affected in the least 


TABLE 10.3—SOLUTION OF AN ANALYSIS OF VARIANCE FROM ORIGINAL MEASUREMENTS 
(Without Determining Deviations from Means) 
Measurements (Reduced) (X’) 


Set I Set II | Set III | Set IV 
4 9 2 7 
5 10 6 7 
1 9 6 gS 
0 6 5 2 
2 6 2 7 
(SH), 12 40 21 27 100 Xx’ 
5.0 M'i 
(2X), 144 1,600 441 729 2,914 =(2X')3, 
Squared Measurements (X"2) 
16 81 4 49 
25 100 36 49 
de f 81 36 16 
0 36 25 4 
4 36 4 49 
@x”), 46 334 105 167 652 SEX”): 


ee SSS EE a u 


by this particular coding process, for the new values, which we shall call 
X’, maintain the same distances from one another and from the means as 
they did before coding. The sums of squares we need for equations (10.2) 
and (10.3) are found by the following procedure. The sum of the between 


variations squared is given by 
2 _ ZZXX’). i r 
nya ee a (5 x ) (M") (10.5) 


The within sum of squares is given by 


ye = ry x"), = eats (10.6) 


244 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 
The total sum of squares is given by 
Zx? = E(X”), — (SX) (10.7) 
The steps called for by these formulas are as follows: 


Step 1. Sum the coded measurements X’ for each set, to obtain (=X), for 
each set (see Table 10.3), and sum these values to obtain SX’. 
Determine the mean M’; to two or more decimal places. 

Step 2. Square the sums of the scores to obtain (2X')*, for each set. 
Accumulate these to find 2(2X")2. 

Step 3. Square all the coded measurements to find the X’? values. 

Step 4. Sum all the squared measurements to obtain 2(2X")4,. 


Now, by formula (10.5), 
ny a? = 2914 _ soo = 582.8 — 500 = 82.8 
3 
By formula (10.6), 
J = = 652 — 2014 = 652 — 582.8 = 69.2 


And by formula (10-7), 
Zx? = 652 — (100)(5) = 652 — 500 = 152 


A check for accuracy of computation is to see that nZd? + Dx’, = Dx". 
The check is satisfied here for 82.8 + 69.2 = 152.0. 

A comparison of these values with those in Table (10.2) will show that 
we have arrived at the very same sums of squares. From here on the com- 
putation of variances and of F ratio is just the same as it was before. The 


same formulas, (10.5) through (10.7), apply also to original measurements 
without coding. 


ANALYSIS IN A Two-way CLASSIFICATION PROBLEM 


In the preceding problem there was only a one-way classification. The 
sets of data were differentiated on the basis of only one experimental varia- 
tion, or were at least treated as if there was only one reason for fractionat- 
ing the data into sets. The principle of division into sets might have been 
time of day, degree of learning, of fatigue, or of illumination. The varia- 
ble, if it was a single one, in which there were differences from set to set 
might have been a quantitative one, as in the examples last mentioned, or 


INTRODUCTION TO ANALYSIS OF VARIANCE 245 


a qualitative one. An example of a qualitative basis of classification in a 
maze-learning experiment would be several learning methods, such as ver- 
bal instruction, demonstration, “putting through,” and prevention of 
errors. A quantitative basis, also in maze learning, might be different 
lengths of time for visual inspection of the maze before starting to learn it. 

In a two-way classification, there are two distinct bases of classifica- 
tion. Two experimental conditions are allowed to vary from trial to trial. 
Usually there are several trials made under each combination of conditions. 
In the psychological laboratory a study of different air-field landing strips 
each with a different pattern of markings may be viewed through a diffu- 
sion screen to stimulate vision through fog at different levels of opaque- 
ness. In an educational problem, four methods of teaching a certain 
geometric concept may be applied by five different teachers, each one 
applying every one of the four methods. There would therefore be 20 
combinations of teacher and method, and let us suppose an equal number 
of pupils each giving a learning score under each combination. 

Tabulation of Data in a Two-way Classification Problem.—For an illus- 
tration of the procedure here, we will assume an experiment on the relation 
of scores on a certain psychomotor test to the size of a target at which the 
examinee must aim. In conducting the experiment it is convenient to use 
three testing machines simultaneously in order to reduce the testing time. 
It is known that there are individual differences between machines, in this 
test, to the extent that it would be risky to attach one target size to one 
machine only throughout the tests. Machine differences might make it 
appear that there were differences attributable to target differences or 
might by chance negate those differences. The target sizes were therefore 
combined with the machines systematically. There were therefore 12 
target-machine combinations with 5 observed scores obtained with each 
combination, The scores (which are entirely fictitious for the sake of a 
good illustration) are tabulated in Table 10.4. This arrangement is typi- 
cal and convenient for the operations of analysis of variance. The sums 
and means, as given, are also needed in the variance solution. 

The Sources of Variance in a Two-way Classification Problem.—We 


could, if we chose, proceed to perform an analysis of variance based upon 


the model of the one-way classification problem as already demonstrated. 


That is, we could take the 12 sets as if they represented categories based 
upon a single principle and test the 12 means collectively to see whether 
they could have arisen by random sampling from the same population. 
We shall see what kind of an answer could be obtained by this approach, 
later, but let us first see what is logically wrong with this kind of solution 
here. 


246 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 10.4.—ScoreEs OF 60 STUDENTS EARNED ON THREE DIFFERENT MACHINES OF A 
PSYCHOMOTOR TEST, EACH WITH THE TARGET., SIZE VARIED IN Four STEPS 


TT a 


Machines 
i > Sums for | Means for 
arget size F 3 3 target size | target size 
6 4 4 
4 1 2 
n 2 5 2 
6 2 1 
2 3 1 
z 20 15 10 45 
M 4 3 2 g 
8 6 3 
3 6 1 
B 7 2 1 
5 3 2 
2 8 3 
z 25 25 10 60 
M 5 5 2 4 
7 9 6 
6 4 4 
(0 9 8 3 
8 4 8 
5 5 4 
= 35 30 25 90 
M 7 6 5 6 
9 7 6 
6 8 5 
D 8 4 7 
8 7 9 
9 4 8 
z 40 30 35 105 
M 8 6 7 T 
Sums for machines. .| 120 100 80 300 
Means for machines. 6 S 4 5 
— a | a 


Suppose we did carry through the solution proposed and found an F 
ratio that indicated significance beyond the 1 per cent level. We would 
not know whether this was due primarily or solely to the differences 
between targets or to the differences between machines, or to both possible 
sources. Suppose, on the contrary, the F ratio indicated no significant 
differences among sets We would not be sure that one of the experi- 
mental variations, perhaps target size, were not actually producing real 


INTRODUCTION TO ANALYSIS OF VARIANCE 247 


variations that were either covered over or counteracted by the effects 
of the other experimental variation. We need some method that will 
segregate the variations associated with each of the experimental variables 
so that any significant differences at all will have a chance to emerge in the 
F test and so that we will know to which source to attribute any significant 
differences found. 

Interaction Variance-——The procedure about to be described makes 
possible this kind of segregation of the sources of variations. Asa result, 
we can then determine whether differences among means owe their 
divergencies to target size or to machine differences, or to both. Not only 
that, when there are two possible sources of variations, there is also a 
possibility of what is called interaction variance. The phenomenon is well 
named. Interaction variations are those attributable not to either of two 
influences acting alone but to joint effects of the two acting together. If 
it turned out that the larger the target the larger the scores tended to be, 
that is one direct and isolable effect. If there are systematic machine 
differences so that among three there is a most “difficult” one (yields 
lower mean scores) and an easiest one (yields higher mean scores), that is 
another distinct effect. There may be effects of target size and machine 
over and above these. It is conceivable, but not very probable, that one 
machine, apart from its general difficulty, gains in difficulty by virtue of 
its having one size of target rather than others. It may be the coincidence 
of machine and target size that produces systematic variation in one 
direction from the general mean of scores. This is an example of inter- 
action variance. It might be more reasonably expected in combination 
of teacher and instruction method; of kind of task and method of attack 
by the learner; and of kind of reward when combined with a certain 
condition of motivation. It is also possible to determine whether there 
is a significant amount of interaction variance present by making an F 


test for it. 

The Residual Variance—There are three F tests to make, therefore, in 
place of one. The remaining variance is known as the residual variance, 
that within sets. It supplies the basic or residual estimate of variance 
after the three sources of variations have been removed and it serves as 
the denominator for all three F tests. It is sometimes called an estimate 
of the error variance for the reason that it represents the influences of many: 
unknown and uncontrolled sources. A perfect experiment would pre- 
sumably control all contributing factors until within each set of data 
observed under a specified combination of conditions there would be no 
longer any variations; each observed value wotld be the same. Most 


experiments are SO imperfect that-there is appreciable error variance. 


248 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Estimation of the Variances from Different Sources.—Two solutions 
will be described, one using deviations of observed values and of means of 
sets from various means, the other using original measurements and means. 
An attempt is made to summarize the operations in terms of formulas, as 
usual, but here the symbolizing of concepts becomes so involved that 
formulas may be more confusing than helpful. Some readers may find it 
easier to follow the examples as models rather than to apply the formulas. 
The systems of symbols employed in the formulas is given in Table 10.5. 
This table provides only three columns and three rows, but it can be 
extended in the directions shown to take care of any number of columns 
and rows. 


TABLE 10.5.—SYMBOLIC SCHEME FOR THE VALUES IN A TABULATION PREPARATORY TO 
ANALYSIS OF VARIANCE IN A TwWo-WAY CLASSIFICATION PROBLEM 


Row Colusa Sums of rows | Means of rows 
DX, M. 
j 3 3 (=X,) (My) 
1 Xa Xa: Xas 
2 
3 
A 4 
K 5 
z DXa DXaz Xas IX. 
M | Ma Maz Mas Ma 
1 Xun Xoz Xos 
2 
3 
B 4 
5 
= | 2Xn | Xa | 2Xn DX, 
M | Mu Miz Mis My 
: Xa Xe Xa 
3 
C 4 
5 
= Xa EX EXa Sx, 
M | Ma Mes Mes Me 
Sums of columns (ZX;).| SX, =X: DX; DXi 
Means of columns (M;).| M, M: M: M: 
HS 
$ A a 
Let X;; = any one of the cell entries, Xa, Bhi a a a ig: 


M, = any one of the set means, May, Maz, ... Mex. 


INTRODUCTION TO ANALYSIS OF VARIANCE 249 


The Solution Based upon Deviations.—In what follows, consistent with 
the symbols in Table (10.5), a subscript k stands for a particular column 
(we might have used c for column, but there would be danger of confusing 
this with a particular row—row C), and r stands for a particular row. 
There are only three columns, 1, 2, and 3, in the psychomotor test prob- 
lem, and four rows, A, B, C, and D. The symbol X;; stands for any one 
observation in row r and column k and M, stands for a mean of the five 
observations in a cell described as being in row r and column &. In the 
following, 1 stands for the number of observations within each set; in the 
illustrative problem x = 5. The number of rows is symbolized by r and 
the number of columns by k. The subscript ¢ refers to the total distribu- 
tion, all sets combined. Thus, M: stands for the mean of the composite, 
and a, stands for a deviation of any X from M+. 

The total sum of squares is given by the equation 


Ish = B(Xy — Mi)? (10.8) 
Applied to the data of Table 10.4, 


Dx = (6 — 5)? + (4 — 5)? + (4 — 5)? (from first row of Table 10.4) 


= 12+ (—1)? + (—1) 
+ E R AE ewes & 
+ 424 (—1)? +3 

= 374 (total sum of squares) 


The sum of squares between rows is given by the equation 
Dd, = nk[2(M, — Mi)" (10.9) 
Applied to the same data, 
Sa, = 5 X 3[(3 — 5)? + (4 — 5)? + (6 — 5) + (7 — 5)1 
15[(—2) + (—1)* + 1° + 24 


15 X 10 
= 150 (the sum of squares between rows) 


The sum of squares between columns is given by the equation 
. 


Idy = nr[E(M; — M)’ 


250 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Applied to the data of Table 10.4, 


Zd’, = 5 X 4[(6 — 5)? + (5 — 5)? + (4 — 5) 
20[1? + (—1)*] 

= 20x 2 

40 (sum of squares between columns) 


ll 


The interaction variance can be estimated in several ways. Perhaps 
the most common way is to derive it from the sum of squares between all 
sets, eliminating the sums of squares between columns and between rows. 
We already know the last two sums of squares. We proceed next to 
compute the sum of squares between sets. The formula is similar to the 
numerator of formula (10.2) but with different notation to fit the new 
system. 

Edp = n[=(M,x — M)? (10.11) 


The symbol d?,,, refers to a squared difference between any set mean and 
the total mean M,. The subscript rz implies that all rows and all columns 
are involved. Applied to the illustrative data, 


a, = S[(4 — 5)? + (3 — 5)? + (2 — 5)? (from first row of means) 
+ (8 — 5)? + (6 — 5)? + (7 — 5)?] (from last row of means) 
PT 1)?-+ (—2)? + (—3)? 4 
A Khia ETEA] 
=5X42 
= 210 (sum of squares between means of sets) 


If we remove from entire sum of squares for the 12 set means the sum 
of squares attributable to columns and to rows, we have left the inter- 
action sum of squares. By formula, 


2d? x, = Dd’, — Ed? — Ed, (10.12) 
in which 2d? (the subscript reads r times k, for reasons that will be 
explained) stands for the interaction sum of squares. For the illustrative 
problem, 

Zd’, = 210 — 40 — 150 
= 20 (interaction sum of squares) 


Another, more direct, way of deriving interaction sums of squares utilizes 


the formula 
> 


Za? xn = n[Z(Mn — Mi — M, + M)’ (10.13) 


INTRODUCTION TO ANALYSIS OF VARIANCE ‘ 251 


in which Mz is the mean of the column in which each particular M,,, appears 
and M, the mean of its row. For the illustrative problem, 


Vag = 5[(4 — 3 —6+5)?+($—3—5+5)? (from first row of means) 


+ (6 —7—54+5)?+(7-7-44+ 5)?] (from last row of means) 
= 50? +0 + ---+(-1?4+ 1] 
5x4 
20 (interaction sum of squares; alternative solution) 


ll 


The sum of squares within sets is computed by the formula 
Sat, = B(Xy — Me)? (10.14) 


This formula, with new symbols, requires the same operations as formula 
(10.3) given in connection with the single-classification problem. Applied 
to the psychomotor problem, 


Be’, = (6 — 4)? + (E — 4)? + 2-4 + © — 4+ 2-4) 
(from set A1) 


ae sahia @ + Hod BREF eam aa cesses aie Salat a 
+ (6—72 + (5-H O- + O— 72+ B- 1? 
(from set D3) 
= 164 (sum of squares within sets) 


We can now check the solution of this by deducting all previously com- 
puted sums of squares from the total sum of squares, and we have 


374 — 40 — 150 — 20 = 164 


We could compute =x*, by this elimination process without going through 
the arduous arithmetic involved in using (10.14), but for checking pur- 
poses it is very desirable to derive all the component sums of squares sepa- 
rately and then check the results. 

Degrees of Freedom.—Before taking the next important step of esti- 
mating population variance from these different sources, we need, as usual, 
the degrees of freedom. Starting with the largest source, the total sum of 
squares, we have, as usual, (N — 1) df, or 59. This figure is to be subdi- 
vided among the contributing components. The sum of squares among 
the means of sets should have allotted to it the number of sets minus 1, or 
12—1=11df. These 11, in turn, are to be allotted to three sources. 
Rows have the number of row observations (row means) minus 1, or 
4—1=3. Columns have, by analogy, 3 — 1 = 2. This leaves 6 out 
of the 11 for interaction. This 6 degrees is 3 X 2, the product of the df for 
rows and columns each source taken separately. This is consistent with 


252 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the idea of interaction itself, whose contributions to variations may be 
regarded as the products of two sources. This is why we use the sub- 
script r X k when referring to interaction. Having taken care of the 
special sources of variations, the remainder, or 59 — 11, gives us the df left 
for within-sets sums of squares. This number of df may also be deter- 
mined directly from a summation of df within sets. Since there are 12 
sets and each contains 4 df, we have 12 X 4 = 48 df for the residual 
variance. 

In terms of symbolic descriptions, the degrees of freedom may be given 
as follows: 


Source Degrees of Freedom 
Between rows r-i 
Between columns k=1 
Interaction (r — 1)(k — 1) 
Within sets N — rk = rk(n — 1) 
Total N-1 


The F Ratios.—We are now ready to estimate the variances and to 
compute the F ratios. These are systematically arranged in Table 10.6. 
There are four different estimates of population variance—50.0, 20.0, 3.33, 
and 3.42. We compare the first three, since they represent possible spe- 
cial contributions resulting from varied experimental conditions, each with 
the fourth. The fourth presumably represents variations of the phe- 


TABLE 10.6.—SourcEs OF VARIANCE IN THE PSYCHOMOTOR-TEST DATA ANALYSIS AND 
F Ratios 


Sum of | Degrees of | Estimate of 


Source 
squares | freedom variance 
Target size (T).. 150 3 50.0 
Machine (M)........ 40 2 20.0 
Interaction (T X M). ai 20 6 3.33 
Within: ass seca oi sinew 164 48 3.42 
STG ce cpa ect bin ata agin xersashece 374 59 
m E ES ee 
Required F 
5% level 1% level 
50 
F for targets = 7a = 14.62 2.80 4.22 
20 
ines = -— = .08 
F for machines 35 5.85 3.10 5.0! 
F for interaction = Seals 0.97 2.30 3.20 


3.42 


INTRODUCTION TO ANALYSIS OF VARIANCE 253 


nomenon measured freed from possible influences of the experimental 
variations. Do the first three differ significantly from the fourth? 

The F ratios are given below the table, together with the F’s required for 
significance at the 5 per cent and 1 per cent levels as determined from 
Snedecor’s table (Table F). From these results it appears that variations 
in target size definitely carry with them systematic variations in test score. 
There is a law of relationship fairly well established between target size and 
difficulty of the test. The F ratio for machines is significant beyond the 1 
per cent level, leaving us with considerable confidence that the machine 
differences, as such, have a real bearing upon the difficulty of the task. 
This conclusion is in some doubt because of possible failure of experi- 
mental design, however. Since the examinees were different groups for 
the three test machines, we cannot be sure that some real differences of 
ability have not combined with minor machine differences to give an appar- 
ently significant machine difference. A matching of examinees for 
machines might have improved the precision of the experiment. This 
would have entailed modification in the analysis-of-variance operations. 
The F for interaction proved to be rather decidedly insignificant. There 
is no reason to believe that changing target size has different effects 
depending upon the machine with which it is associated. 

Removal of Sources of Variation—It may illuminate the concepts of 
different kinds of variance and the way in which they contribute to total 
variance in the sample if we separate them in another way. Table 10.74 
shows the 12 means of sets for the psychomotor-test data. Variations 
among them are due to the three possible sources—target differences, 
machine differences, and the interaction of the two. The possible effects 
of target size are most apparent in the means of the rows—3, 4, 6, and 7. 
The possible effects of machine differences are most apparent in the means 
of the columns—6, 5,and4. The possible interaction variance is obscured. 
It possibly contributes both to the means of rows and of columns; we do 
not know. Let us strip away first the variations attributable to machines 
and then that attributable to targets and see what variations are left. 

The mean of all observations is 5. Any deviation of a column mean 
from 5 indicates a constant error for a particular machine. Machine 1 
gave a mean of 6, indicating that machine 1 had a constant error of +1. 
Machine 2 apparently had no constant error, while machine 3 had a 
constant error of —1. If we deduct from each cell or set mean in column 
(1) the amount of constant error involved for machine 1, we would per- 
sumably remove from the means in column (1) the influence of machine 1 
as a source of variation. We can do likewise for column (3), deducting the 
constant error of —1, which is equivalent to adding +1 to each mean. 


254 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 10.7.—ANALYSIS OF THE BETWEEN-SETS SUMS OF SQUARES IN THE PSYCHOMOTOR- 
TEST DATA INTO THREE COMPONENTS BY SUCCESSIVE REMOVAL or CONTRIBUTING 
SOURCES OF VARIATION 


Column 
M 


Mw 


Row 


A. Original Matrix of Means of Sets 


A 4 3 2 9 3 
B 5 5 2 12 4 
Cc 7 6 5 18 6 
D 8 6 7 21 7 
z| 24 20 16 60 
M| 6 5 4 5 


B. With Variations Associated with Machines Removed 


A 3 3 3 9 3 
B 4 5 3 12 4 
Cc 6 6 6 18 6 
D 7 6 8 21 7 
=| 20 20 20 60 
M 5 5 5 5 


C. With Variations Associated with Target Size Also Removed; Only Interaction 
Variance Remaining 


A 5 5 S) 5 S 
B 5 6 4 15 5 
c 5 5 5 15° | 5 
D 5 4 6 15 5 
=| 20 | 20 | 20 | 60 
M 5 5 5 5 


We need do nothing for column (2). The results of these operations are 
shown in Table 10.7B. The means of the columns are now all 5, to agree 
with the composite mean, M;. The means of the rows have been unaffected 
(they are still 3, 4, 6, and 7) because the changes in one column are 
compensated for by changes in reverse direction in another column. The 
cell values in Table 10.7B still have in them the variance attributable to 
targets and to interaction variance. 

Next we remove the target variance. The constant errors for rows are 
—2, —1, 1, and 2, respectively. Deducting these from the values in their 


# 


INTRODUCTION TO ANALYSIS OF VARIANCE 255 


respective rows in Table 10.7B, we have the results in subtable C. The 
means of the rows as well as of the columns are now all 5. But within four 
_ cells there are departures from 5. These are possibly the interaction 
deviations, depending upon whether or not they prove to be significant. 
Machine 2 would seem to favor high scores when coupled with target B 
and to favor low scores when coupled with target D. Machine 3 has a 
reverse tendency. But the F test showed these deviations to be insig- 
nificant. There seem to be no good logical reasons to expect any syste- 
matic coupling of target and machine. In other problems there may be 
significant interaction effects, but one would expect them to be systematic 
when experimental variables are quantitative in character. 

The finding of insignificant deviations among the means suggests 
several things. One is that these variations are random sampling effects 
that really belong to the within variance but were not pulled out with it. 
There is good reason, therefore, for combining this source of variance with 
that from within sets. ‘The sum of squares for this was 20. Combined 
with that from within sets, we have a total of 184. With 48 degrees of 
freedom, we have a within variance that is raised from 3.42 to 3.83. This 
change is not enough to make any material difference in the F ratios for the 
target or machine sources. Our inferences about those sources being 
significant remain unchanged. 

Second Solution—From Original Measurements.—Next will be given 
the formulas and their application for the solution of means of squares 
without computing deviations. With small integral numbers to start 
with, or numbers coded to such magnitude, these procedures are often 
more convenient than those utilizing deviations. The first solution, with 
deviations, is more meaningful to the beginner. In the following exposi- 
tion, each formula will be stated then immediately applied to the psycho- 


motor-test data. 


Total sum of squares: 


DXy)? 
Sa Y e- etok (10.15) 
= (6? + 4 + 4 (from first row of Table 10.4) 


+ at dials) aiia 
9? + 42+ 8°) 

lai (from last row of observations in Table 10.4) 
(300)? 

= 360 


= 1874 — 1500 
= 374 (total sum of squares) 


256 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Sum of squares between sets: 


Jau aen mt aots 


4g jo + 15? + 10?(from first = row of Table 10.4) 


+ 40? + 30? + 35%)] (from last = row of Table 10.4) 
(300)2 
~~ 60 
= 1710 — 1500 
= 210 (sum of squares between sets) 


Sum of squares between rows: 


» _ 2(2K)* _ EH) mise 
2 si er i ( ) 
(145(452 + 602 + 90? + 150°)] — 1500 

1650 — 1500 

150 (sum of squares between rows) 


oud 


Sum of squares between columns: 


ya = IA M a (10.18) 
= [340(120? as 100? + 802)] — 1500 
= 1340 — 150 


= 40 (sums squares between columns) 
Sum of squares for interaction: 
Idre = Ed? — Dd?, — Dd% (10.19) 


= 210 — 150 — 40 
= 20 (sum of squares for interaction) 


Sum of squares within sets: 


Dr’, = Dx, — Id? (10.20) 


ll 
w 
eN 

| 
N 
9 
o 


I 


= 164 (sum of squares within sets) 


It will be noted that the correction factor ($ X;;)?/N, which appears in 
most of these equations, is identical and once computed will do thereafter. 

The sums of squares by this method are seen to be identical with those 
found by the preceding method. The estimation of the population 
variance from each source and the application of the F test are the same as 
before (see Table 10.6). 


INTRODUCTION TO ANALYSIS OF VARIANCE 257 


An EVALUATION OF ANALYSIS OF VARIANCE 


Assumptions to Be Satisfied in Applying Analysis of Variance.—Like 
most statistics, those involved in analysis of variance have been derived 
on the basis of mathematical reasoning, and that reasoning’ starts with 
certain assumptions. If those assumptions are satisfied within certain 
limits of tolerance, the results in terms of F ratios may be interpreted as 
described in this chapter. If those assumptions are not sufficiently 
approximated, there is considerable risk that the conclusions may be 
faulty. 

There are four assumptions often specified. They are:! 

1. The contributions to variance in the total sample must be additive. 
This is implied in the equation (10.1), in which the total sum of squares 
is assumed to be a summation of sums of squares from within sets plus 
sums of squares from between sets. The same summative idea is illus- 
trated in Table 10.7 in which we stripped off one by one the three sources 
of variance, The additive nature of variations squared is dependent to 
some extent upon other assumptions to follow. 

2. The observations within sets must be mutually independent. The 
“laws of chance” must be allowed to operate jn'an unrestricted way. The 
occurrence of a certain deviation in one observation must be in no way 
dependent upon any other deviation. This is, of course, a good descrip- 
tion of random sampling. The random sampling occurs within sets. 
The intentional variation of experimental conditions may produce sys- 
tematic variations between sets. Whether or not systematic variations 


do occur is the thing being tested. 

3. The variances within experimentally homogeneous sets must be 
approximately equal. By “experimentally homogeneous” is meant 
observations under one specified set of experimental conditions. The 
“within-set” variance, is, of course, the denominator of every F ratio. 
Tt therefore carries a heavy burden, especially if there is more than one F 
to be computed for the same data. This variance is used as a single esti- 
mate of the population variance, and all contributors to it should tell the 


same story. If any two sets furnish widely divergent ideas of the popula- 
tion variance, the latter is not very accurately estimated. If there is 
serious doubt about the variances indicated by any two sets, we can, and 
should, make an F test for the difference between those two variances. 
If F is so high as to cause rejection of the null hypothesis, we should not 
use both set results as sources of a within sum of squares. 

1 Cochran, W. G. Some consequences when the assumptions for the analysis of 
variance ae not satisfied. Biometrics, 1947, 3, 22-38. 


258 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCA TION 


4. The variations within experimentally homogeneous sets should be 
‘normally distributed. Most subsamples will be so small that it will be 
out of the question to test the distribution for normality by the chi-square 
test. Decision as to normality or lack of it will have to come from other 
sources. Knowledge of the form of population distribution would be 
sufficient basis, if sampling is random. Skewness, if marked, is a serious 
source. of violation of the validity of an F test. 

If we follow the practice of free and random sampling within sets and 
if we use a metric scale on which there is lack of restriction and on which 
units are equal, we can feel assurance that the F test will not be invalida ted. 
It must be remembered, however, that conditions of sampling are never 
ideal. F tests are therefore usually only approximate. Under somewhat 
doubtful circumstances, an F that proves to be significant at the 5 per cent 
level may be actually significant anywhere from the 4 to the 7 per cent 
level; one significant at the 1 per cent level might actually be significant 
between the 0.5 per cent and 2 per cent level.! If anything, the signifi- 
cance is likely to be Jower than that indicated by the result, when assump- 
tions are not well satisfied. 

General Uses and Limitations of Analysis of Variance.—There is insuf- 
ficient space here to do more than to give this introduction to the analysis- 
of-variance methods. There are many and varied applications of these 
basic cases—the separation of variance among a few sets of data into the 
“within” and “between” variances—both in psychology and in education. 

Sets of data may be divided according to chronological-age groups, 
mental-age groups, sex-difference groups, etc. In psychophysical experi- 
ments, judgments of phenomena may be made under various conditions— 


fatigue or rest or after different degrees of practice. In education, the 
testing of different teaching methods can be done in different schools, in 


often vary in several directions at the same time. This complicates the 
analysis-of-variance solutions in various ways. There are also problems 
in which the sets of data are not independently observed, as was assumed 
in the present chapter. There is a technique for the analysis of covariance 


as well as of variance. Covariance and correlation are closely related, as 


1 Cochran, op. cit. 


: 


INTRODUCTION TO ANALYSIS OF VARIANCE 259 


will be shown in later chapters. For further descriptions of how to adapt 
the method to various kinds of experimental problems, the reader is 
referred to books that treat the subject at much greater length. 

By way of hasty evaluation of the method, it may be said that analysis 
of variance undoubtedly provides a powerful tool of working through data 
in order to see where the significant lines of cleavage lie and thus furnishes 
some basis for establishing the presence of laws. It can also be said that 
the method requires supplementary procedures for a more detailed study 
of data and that there are other statistical methods—for example, corre- 
lation procedures—that enable us to accomplish the same purpose in many 
instances. Í 

Not the least of its merits is the rather strict set of requirements it 
presupposes in the designing of experiments. Experimental designs have 
generally been observed, particularly in psychophysical research, for a 
long time. But they have generally not been so consciously considered 
or'so well planned so as to yield the maximum number of dependable 
answers as is true when the experimenter has kept clearly in mind the 
corresponding statistical tests that go with those designs. The subject 
of experimental design is well treated in the book by Lindquist already 
cited, so far as certain educational problems are concerned. Discus- 
sions of designs for psychological and other experiments may be found 


elsewhere.? 
‘ 


Exercises 


Assume that Data 10A represent measurements of the lower threshold for pitch of 
tones under the following conditions. The observer was the same throughout. Each 
“trial” was composed of 100 observations, of which four were selected at random. 
The 400 observations were made during the same half day, with only short rest pauses 
after each 25 and with 10 minutes between trials. 

1. Using the four sets of observations made on the first day, apply an F test to deter- 
mine whether there were systematic changes in threshold level from trial to trial. 
Estimate variances using deviations from the means. Interpret your results, 

2. Make a similar test of the observations made on the second day, estimating vari- 
ances from the original measurements. If F proves to be significant, make ¢ tests to 


determine where the genuine changes occur. Š 


1 Lindquist, E. F. Statistical analysis in educational research. New York: Hough- 
ton-Mifflin, 1940. Snedecor, G. W. Statistical methods. Ames, Iowa: Collegiate, 1937. 
Johnson, P. O. Statistical methods in research. New York: Prentice-Hall, 1949, 
McNemar, Q. Psychological statistics. New York: Wiley, 1949. 

?See Baxter, B. Problems in the planning of psychological experiments. Amer. J. 
Psychol., 1941, 54, 270-280. Also, Fisher, R. A. The design of experiments. Edin- 


burgh: Oliver & Boyd, 1935. 


260 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Dara 104.—Dara IN A Two-way CLASSIFICATION 


Trial 


Day 
I II Ill IV 


24 19 21 24 
1 26 12 16 18 
21 17 17 22 
17 20 18-| 18 


18 15 16 15 
2 19 15 19 19 
18 14 17 16 
17 12 14 18 


3. Treat the entire table of data as a two-way classification problem. Make F 
tests to determine the significance of the three special sources of variance (between 
trials, between days, and interaction of trials and days). Interpret your results. 

4. Take out each source of variance in Data 10A step by step, as was demonstrated 
in Table 10.7. 


CHAPTER 11 
TESTING HYPOTHESES 


We have already emphasized the point that experiment and statistical 
method go hand in hand. The one supplements the other. The experi- 
ment directs our observations and yields data. By means of statistical 
methods, we can summarize those data, interpret them, and determine 
their reliability. 

The best experiments are those that are set up to test the truth or 
falsity of some hypothesis. From previous experience, we believe a cer- 
tain thing to be true, but it requires a crucial test to enable us to accept 
or to reject the hypothesis. If the result comes out one way, the hypoth- 
esis is probably correct; if it comes out another way, the hypothesis is 
probably wrong. The term “probably” is inserted because there is no 
such thing in science as absolute certainty. We are only more or less 
sure that the result points to one conclusion rather than to another. 

The assurance of a conclusion may be of any degree of intensity from 
“doubtful” to “maybe” to “very likely” to “almost certain” to “ practi- 
cally certain.” Statistical procedures give more definite meaning to those 
degrees of doubt and assurance. In this chapter, particularly, we shall 
be concerned with giving those concepts more exact meaning, so that we 
may be able to conclude whether certain outcomes of observations could 
perchance have arisen by accident or whether they point to something 
definitely not accidental. 


NULL HYPOTHESES 


General Meaning and Application of a Null Hypothesis.—In the two 
preceding chapters we had incidental references to null hypotheses. Here 
we will see a number of other applications of them. We very properly 
say “null hypotheses” in the plural, for there are many ways of stating a 
null hypothesis, depending upon the nature of the experimental problem. 
In very general terms, this kind of hypothesis merely states that in an 
experimental situation, or even in a nonexperimental situation, whenever 
things are enumerated or measured it is assumed for the sake of argument 
that nothing but the laws of chance are operating. An illustration from 


experiments on extrasensory perception (ESP) is very suitable. 
261 


262 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Suppose that an experiment with the Duke University ESP cards is 
properly set up to prevent the receiver from being influenced by any cues 
except possible telepathic stimulation. There are five different symbols 
on the cards, and in a thoroughly shuffled deck they should come up at 
random. As each one comes up and an experimenter reads it silently, 
the receiver makes his judgment. The card is returned to the deck, which 
is reshuffled, and the next one to be transmitted is selected. Starting 
with the hypothesis that there are no factors (including ESP) at work to 
determine the receiver’s responses, we should expect in the long run an 
average success of 20 per cent right, or 1 in 5. If any receiver gives an 
excess of correct responses over and above 20 per cent, we still have to 
determine whether this excess is significant or whether it could have 
occurred by. the processes of sampling in his limited number of trials. 
Tf the excess is one that could have happened as much as once in 10 times 
(one sample of this size out of 10 such samples), we should still say that 
the null hypothesis is quite plausible. We could not say that it is cer- 
tainly established; but we would by no means give it up. Even if the 
excess over 20 per cent were one that could happen less than once in 
20 samples, though we should be more skeptical of the null hypothesis, 
we should be unjustified in completely rejecting it. When so large a dis- 
crepancy as we obtained could occur by sampling less than once in 100 
times, we customarily reject the hypothesis. We then say that it is 
highly implausible. In making this decision, there is only one chance in 
100 that we have made an error. 

But note that this does not automatically lead us to conclude that the 
alternative (ESP) hypothesis is true. It does tell us that something other 
than guesswork is going on, but it does not tell us what that “something 
other than guesswork” really is. If our experiment is designed so as to 
exclude all other possible factors than ESP in this case, then, having 
reduced the crucial experiment to an either-or proposition, i.e., either 
laws of chance or ESP, and having overwhelming indication that the 
chance hypothesis is wrong, we can accept the ESP hypothesis as true- 
Unfortunately, the identification and control of all other factors favoring 
correct responses here is exceedingly difficult. But, in general, the estab- 
lishment of an experimental fact depends upon it. We shall see shortly 
how a statistical test of the null hypothesis can be made for this type of 
experiment; but first let us consider some simpler cases. 

Direct Determination of the Probable Validity of a Null Hypothesis.— 
Our first example is a simple psychophysical test situation. A student 
asserts that he can distinguish between two tones whose stimuli differ 


TESTING HYPOTHESES 263 


only 2 cycles persecond. That is his hypothesis: that he possesses genuine 
power to discriminate this difference in pitch. We doubt him, thus auto- 
matically adopting a null hypothesis. Out of 6 trials, how many pairs 
should we require him to judge correctly before we give up our hypothesis 
and yield to his? Our hypothesis implies that when he judges the pair 
of tones he might just as well flip a coin and report “second higher” for 
“heads” and “second lower” for “tails.” We should expect him, by such 
guessing, to be correct half the time or 3 times out of 6. But how much 
of an excess over 3 correct judgments will it take to convince us that he is 
not merely guessing? : 

In a set of 6 trials, there are 7 possible outcomes—all the way from 6 
down to 0 correct judgments. In Table 11.1 are listed all the 7 possi- 


TABLE 11.1—ExprcreD OCCURRENCES AND PROBABILITIES OF SPECIFIED NUMBERS OF 
Correct JUDGMENTS IN MAKING SIX JUDGMENTS AT RANDOM 


Number of | Times expected ere dg Probability of | Probability of 
n correct in 64 sets scame in as many or as few or 
judgments of judgments enticed EET more occurring | less occurring 

6 1 1/64 1/64 64/64 

5 6 6/64 7/64 63/64 

4 15 15/64 22/64 57/64 

3 20 20/64 42/64 42/64 

2 15 15/64 57/64 22/64 

1 6 6/64 63/64 7/64 

0 1 1/64 ` 64/64 1/64 


bilities and the probability of each event’s occurring by random sampling 
(chance). According to the probabilities involved in the situation, we 
should expect only one “score” of 6 in 64.samples; we should expect 6 
“scores” of 5, 15 “scores” of 4, and so on. These expectations are 
according to the laws of probability. 

The Use of Binomial Expansion.—A mathematical way of deriving the 
probabilities for the seven scores is to apply the expansion of the binomial 
(44 + 14)8. In tossing a coin there are two possible, independent, out- 
comes: head or tail. The theoretical probability of a head occurring is 
1/2 and the probability of a tail is also 1/2. The general expression for 
the binomial is (p + q)", where » is the number of coins tossed. As in 
the case of proportions, p + g = 1, and 1 to any power also equals 1, 
so (p +q)” = 1.0. The generalized binomial expansion is 


264 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


n = pn TE a n(n — 1) n—2) g2 

ar pei pige aa g 
nin — 1)(n — 2) za — 1) — 2)@2 —-3) yy 
"These f° Etak PT 


feet (114) 


in which p and q can have any positive values so long as p + q = 1. 
Applied to the problem with 6 coins, (7 = 6), 


Ne aN" UV TAN BSS TLV (LY, 
(+4) =() +6(2) (4) + $3 (4) 3 
6x5x4/1)*(1)° beeudniie a 
TIxax3\a) (a) tixexsx4l2) 2 
6x5xX4x3~x 2/1) (1) (3) 
+ $25 Ex SX2 (1) (SY 4 2 


1. 6,15, 20,15, 6 i 
sata a ara a aT" 


If the seven fractions are summed the result is equal to 1. The proba- 
bilities coincide with those in Table 11.1 for the various scores. The 
numerators give the expected frequencies of the scores 0 to 6 inclusive, 
when the total number of scores is 64. 

Testing Deviations from Expected Values —In determining whether the 
student’s hypothesis about his acuity for pitch differences has much claim 
for acceptance, we are interested in how far his obtained score deviates 
from that to be expected by chance. A chance score in this situation 
would be 3 correct judgments out of 6. How much deviation from a score 
of 3 does he need to overthrow the null hypothesis? A score as high as 
6 would be expected one sixty-fourth of the time. One chance in 64 
would seem to be between the 5 per cent and 1 per cent levels so com- 
monly applied. But remember that these standards are applied to devi- 
ations in both directions from the mean. The chance for a score of 6 is 
equal to that for a score of 0, which deviates an equal amount in the 
opposite direction. The probability of either event occurring is the sum 
of the probabilities for the two separately, or 2/64, or 1/32. Thus, if 
the student obtained a score of 6, we would still have some confidence in 
his claim, even though the probability is not much beyond the 5 per cent 
level. 

An obtained score of zero would be interesting to interpret. From @ 
common-sense point of view one might argue that such a score is positive 
evidence for lack of ability. But note that from a statistical standpoint 
a score of 0 is just as significant as a score of 6. We should be just as 


TESTING HYPOTHESES 265 


inclined to reject the null hypothesis as when the score was 6. But the 
alternative inference would be different. Whereas a score of 6 would lead 
to some belief in real ability to judge differences in pitch, a score of zero 
would indicate something biasing the student in the direction of reversal 
of judgments. If this conclusion seems unreasonable, remember that 
with zero ability in this task a 6 is just as likely to occur as a 0, if judg- 
ments are not otherwise biased. An obtained score of 6 can be one of 
the events resulting from pure guessing. This should stress the impor- 
tance of setting a high confidence level as a basis for rejecting a null 
hypothesis. 

How about deviations of +2 or —2? Here we consider the odds of 
his having a score of 5 or higher or of 1 or lower. In Table 11.1 these 
probabilities are each 7/64, or combined, 14/64, or a little less than 1/4. 
Such an event would be less likely to occur than not, but the deviation 
would be much too small to be taken seriously. We would not reject 
the null hypothesis. A score as high as 4 or as low as 2 (which means 
all possibilities except a score of 3), each with a probability of 22/64 (see 
Table 11.1), gives a probability of 44/64, which means more often than , 
not. On the whole, we would conclude from this line of reasoning that a 
test involving only six judgments would not be very decisive even when 
all judgments were correct. We should want more than six trials and 
we could determine the number of correct judgments required to justify 
rejection of the null hypothesis by a procedure like that described. 

We consider next a case with a larger number of trials—a set of 10 true- 
false test items to which a student gives one of two alternative responses, 
one right and one wrong. How many more than 5 items must he do 
correctly for us to reject the hypothesis that he knows nothing about the 
subject matter of the examination and that he is merely guessing at 
random? The probabilities corresponding to the four highest scores are 


TABLE 11.2.—ExpEcrep OCCURRENCES AND PROBABILITIES OF SPECIFIED NUMBERS OF 
Correct Responses To 10 TRUE-FALSE TEST ITEMS 


ms bili Probability Probability 
N Expected | Probability Probability of alike -~ of a like 
umber of ber of this of this DENI cons 
correct numbe: a namke deviation deviation 
tesponses right in number by Tilia in opposite in either 
1,024 sets chance igher direction direction 
10 1 1/1,024 1/1,024 1/1,024 1/512 
9 10 10/1,024 | 11/1,024 | 11/1,024 11/512 
A 45 45/1,024 56/1,024 56/1,024 7/64 
7 120 120/1 ,024 176/1,024 176/1,024 11/32 


266 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


given in Table 11.2. These probabilities are derived in the same manner 
as those for the preceding problem, namely, from the application of the 
binomial equation. In this case the equation is (14 + 14)", 

From Table 11.2 we see that a score as extreme as 10 (a deviation of 
5 from the expected or mean score of 5) could occur only once in 512 
attempts. A score of 10 almost certainly indicates some knowledge or 
ability measured by the test, though we do not know just how much. 
A score as extreme as 9 could occur 11 times in 512 attempts or about 
1 in 46 attempts. This would indicate probable knowledge or ability but 
not with very great assurance. A score of 8 could occur about once in 
9 attempts and is consequently not at all fatal to the null hypothesis— 
the hypothesis of no knowledge or ability in the area sampled by this 
testi! 

Departures from Random Conditions—In applying such tests of the 
null hypothesis to any practical situation such as this, however, it must 
be kept in mind that we are assuming that in the event of complete 
ignorance the examinee will guess purely at random. Experience tends 
to show that in the absence of knowledge human beings do not always 
guess or respond at random. They exhibit patterns of responses or 
pattern habits. With biases such as this in the picture, hypotheses based 
upon chance distributions must be made with great caution and sometimes 
are precluded. The presence of bias cannot be easily detected, but one 
evidence of it would be a “significant” deviation in an unreasonable 
direction, as when in a guessing situation a statistically significant num- 
ber of wrong judgments or Tesponses occurs. Goodfellow has shown in 
connection with “experiments” on telepathy over the radio, for example, 
when an audience made five successive guesses of “black” versus “white” 
there are a number of common sequence patterns.? Alternations occur 
less frequently than one would expect by chance; runs are avoided; 
and certain initial responses may be favored, sometimes in response to an 
incidental cue that an experimenter might well overlook. 

The presence of such nonrandom effects is bothersome, but there are 
experimental controls that may help to prevent them. There is probably 
enough randomness under a wide Tange of behavior to make possible a 
very profitable use of the statistical tests that depend upon it. 


For a discussion of the problems of testing whether “runs” of the same response 
are of sufficient length to justify rejection of the null hypothesis, see Grant, D. Ay 
New statistical criteria for learning and problem solution in experiments involving 
repeated trials. Psychol. Bull., 1946, 43, 272-282. 

? Goodfellow, L. D. The human element in probability. J. gen. Psychol., 1940, 23, 
201-205. 


TESTING HYPOTHESES 267 


Hypotheses Based upon the Normal Curve.—In the previous illustra- 
tions, we actually counted up the total number of possible outcomes and 
also the number of times certain outcomes would be expected, and from 
these we obtained directly the probabilities that the null hypothesis was 
incorrect, There are other instances, when the number of responses we 
deal with is quite limited, in which a similar counting of cases can be 
done and the probability of extreme deviations from chance can be derived. 
When the number of possible outcomes is not small, however, this counting 
of cases, or even algebraic computations of permutations and combina- 
tions, is much less efficient than other methods that will be described next. 

In a certain elementary-psychology laboratory experiment, we have the 
problem to determine whether students can perceive from photographs 


Frequencies 


eee 


10 13.5 14.5 15.5 
1560 20lo 
Score scale 


Fic. 11.1—Standard-score distance from the hypothetical mean of the integral scores 
14, 15, and 16 (correct judgments out of 20) when each judgment has an even chance of 


being right or wrong on the hypothesis of complete ignorance. 

whether or not a man has been convicted of crime. Pictures of 20 pairs 
of men matched for certain qualities are exhibited, and the student judges 
which of the two is the criminal. The null hypothesis calls for 10 correct 
responses, provided that only random guessing accounted for the score. 
How large an excess is indicative of actual perception or of something 
other than chance? 

To solve this problem, we do not resort to counting up the probabilities 
of as many as 20, 19, 18, etc., or more correct responses. Rather, we 
assume that each set of 20 judgments is a sample and that such samples 
would have a mean of 10, and a standard error of this mean will be the SE 
of a frequency, which equals ~/V pq (see formula 9.19). We also assume 
a normal distribution of the samples of frequencies. For this problem, 
N is 20, p is 5, and q is .5. The os is therefore v 20 <5 x 5 = 272368 
The distribution of these frequencies is shown in Fig. 11.1, with a mean of 
10 and ac of 2.236. We are now ready to ask about the probability of a 
randomly determined score being as high as X or higher. For example, 
would a score of 14 be significantly in excess of the expected score of 10? 


268 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


At first thought, this excess is 4 units above the mean of the distribution. 
But remember that a score of 14 is customarily one that occupies the inter- 
val from 13.5 to 14.5. A score of “14 or above” in this case therefore 
takes in all the normal curve above the point 13.5. It is a different 
matter to ask what is the area under the normal curve above the point 
14.0 and to ask what is the area under the curve for a score of 14 or 
above. The deviation of the lower limit of this score from the mean is 
3.5 units. Dividing this deviation by ø, which is 2.236, we have a ¢ equal 
to 1.57.1! Going to the probability table (Table B) with this standard 
score, we find the proportion of area above the point 13.5 to be .0583. 
Remembering that a score of 6 could occur as often as one of 14, and the 
probability of a score of 6 or below would also be .0583, the combined 
probability for these two alternative events is .1166, or about 12 chances 
in 100. A score of 15, which begins at 14.5, is 2.01s above 10, and the 
probability of a chance score this high or higher is .0223. Combining 
this probability with a like one for a chance score of 5 or below, we have 
-0446. Such a deviation is just significant. e 

A score of 16 is 2.460 above the mean and has only about 15 chances in 
1,000 of occurring by guesswork. If all secondary cues, i.e., cues not 
having to do with objective signs of criminality versus noncriminality in 
the photographs, were eliminated, we could conclude that the student 
who earns a score of 16 probably has the ability to make this kind of 
discrimination. If, however, we had obtained 1,000 scores and only 7 
(approximately) were this large, we should, on the basis of this much 
larger experience, revert to the null hypothesis. But when we are 
restricted to a single sample of 20 judgments, the statistical tests justify 
us in rejecting the hypothesis when the score is as high as 16. 

How Large a Deviation Is Significant?—To return to the ESP problem, 
in 50 trials, when the probability of chance success is .20 and so the 
expected frequency is 10, the standard error of the frequency is 


50 X .2 X .8 = 2.83 


We could now test the plausibility of the null hypothesis in the face of 
different numbers of correct responses in excess of 10. But it might be 
more to the point to ask how large a score it would take to be significant 
and how large a score to be very significant.2 


1 Remember that # is a standard measure, and in a normal distribution may be used 
just as z is used. 

2 See discussion on p. 200 regarding the sampling distribution of p (which also applies 
to the distribution of f). 


TESTING HYPOTHESES 269 


To be significantly in excess of 10, a score of X or larger could happen 
by chance only 2.5 per cent of the time. What point on the score scale 
comes at such a position? From the table, the ¢ corresponding to this 
point is 1.96. This value times o is 1.96 X 2.83 units on the score scale. 
This excess added to 10 gives us 15.5. Remembering that a score of 16 
really begins at 15.5, we conclude that aż least a score of 16 or higher is 
required to be significant of anything over guesswork. To be very signifi- 
cant, the tail probability is .005, ¢ is 2.576, and the excess is 7.3. This 
gives a point of 17.3 on the score scale. In terms of whole numbers, 
it requires a score of 18 or better to be very significant and to cause us to 
reject the null hypothesis. A score of 25 or better (above 24.5 on the 
scale) is 5.12s above the mean, and there is less than one chance in a 
million that so large an excess could occur by guessing alone. Such scores 
demand an explanation, but the explanation is not inevitably to be in 


30.0 31.0 32.0 33.0 34.0 
29.6 (Hypothetical population means) 
(Sample 
mean) 


Fic, 11.2,—Hypothetical sampling distributions corresponding to various hypotheses 


concerning the population mean when the obtained or sample mean is 29.6. 


terms of ESP unless other hypotheses have been adequately rejected by 
reason of rigorous and known control of experimental conditions. 

Testing Different Hypotheses about the Population Mean.—It can now be 
shown, more appropriately than before, how, in the absence of other 
information, the obtained or sample mean is the most plausible value of 
the pepeledon mean. Let us return to the ink-blot test data used so 
many times in previous chapters. ‘The mean of 50 scores was 29.6 in 
one sample, and the standard deviation was 10.45, from which the oar 
was estimated to be about 1.49. Let us choose several possible values 
that the population mean might have and in each case we will see how 
likely it is that a sample mean of 29.6 could then have arisen by random 
sampling. Figure 11.2 shows five hypotheses concerning the population 
mean: that it is, in turn, 34.0, 33.0, 32.0, 31.0, and 30.0. 

Consider, first, the hypothesis that is farthest from the sample mean, 
namely, a hypothetical population mean of 34.0. This calls for an 
assumed sampling distribution with a mean of 34.0 and a standard devi- 
ation of 1.5. A sample mean of 29.6 deviates 4.4 from this hypothetical 


270 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


mean. ‘This deviation gives a ¢ of 4.4/1.5 = 2.95. What is the proba- 
bility of a deviation as large as this occurring by random sampling? It is 
twice the area under the tail of the normal curve (since this sample 
qualifies as a “large” one). From Table B, we find the area in one tail 
beyond a z of 2.95 to be .0016. The area in both tails is .0032, from 
which we conclude that in only about 32 cases in 10,000 ‘could a deviation 
as large as 4.4 units occur by random sampling. We therefore reject 
this hypothesis with considerable confidence. 

The next hypothesis is for a population mean of 33.0, which gives a 
deviation of 3.4 for the sample mean and a / of 2.28. The area under 
the unit normal curve beyond this point is .0113. Twice this area is 
0226. We can reject this hypothesis with only about 2 chances in 100 
of being wrong. If we hypothesize a population mean of 32.0, the devi- 
ation is 2.4, / is 1.61, and the tail area is .0537. The chances for such a 
random deviation are more than 10 in 100. If we hypothesize a popu- 
lation mean of 31.0, the deviation is 1.4, fis .94, and the area beyond 
this ¢ (in both directions) is .348. ‘There are 348 chances in 1,000 that 
so large a deviation as 1.4 units could occur by random sampling. We 
do not reject the hypothesis that the population mean is 31.0. To go 
one step closer to the sample mean with our hypothesis, let us choose 
30.0 as the population mean. This leads to an area of .394 in one “tail” 
and .788 in the two. The odds are 788 in 1,000 that a deviation as large 
or larger than that of 29.6 from 30.0 could have occurred by random 
sampling. Thus, as we approach the sample mean closer and closer with 
our hypothetical population mean, the odds keep increasing, which indi- 
cates that the plausibility of the hypothesis increases. The maximum 
plausibility would be reached when the hypothesis is 29.6, in other words, 
when it coincides with the sample mean. 

In this discussion, we have omitted reference to the customary 5 pet 
cent and 1 per cent levels. We could choose hypothetical population 
means such that the deviation of the sample mean from them would give 
#s at those particular levels. These deviations are known as fiducial 
limits. They mark off the limits of all hypotheses that give less than 
5 per cent or 1 per cent degrees of confidence of rejection. All hypotheses 
of population means differing more than about 2.9 (which is 1.96 times o») 
from the sample mean (in the ink-blot data) can be rejected with at least 
the 5 per cent degree of confidence. There are less than 5 chances in 100 
of being wrong in so doing. All hypothetical means differing more than 
3.8 can be rejected with confidence at the 1 per cent level. These inter- 
pretations of ox seem roundabout but in all logical accuracy this is how 
they are best made, though they are not so simple as those in Ch. 9. 


TESTING HYPOTHESES 271 


How Large a Sample Is Necessary for Significant Deviations from Null 
Hypotheses?—We have already raised and answered the kind of question 
that asks for a given size of sample how large a discrepancy is necessary 
for significant and very significant deviation from a null hypothesis. 
Here we face a little different kind of question. We let our relative 
excess remain constant and ask how large N must be in order for that 
same size of discrepancy to reach the critical levels. 

In a survey like the Gallup poll, for example, one would constantly be 
faced with the question of how large a sample to obtain; how many inter- 
views to make; how many responses to a stimulus to record. That mere 
numbers in a sample as such are not sufficient to guarantee predictive 
ability was brought home to us decisively by the unhappy Literary Digest 
poll of 1936. Though the votes sampled ran into the millions, the voters 
who really determined the outcome of the presidential election were not 
adequately represented in the sample. A good poll sees to it that every 
kind of group of voters where group differences count at all are propor- 
tionately represented in the poll. When this is accomplished, it is sur- 
prising to the uninformed person how small a total sample can yield a 
valid predictive index. In other words, it is not so much enormous num- 
bers that count as how the sample is made up. 

Let us assume that our sample is properly made up with good repre- 
sentation.! Let us assume an issue where majority vote is decisive. Our 
null hypothesis is then 50 per cent or a proportion equal to .50. We ask 
first how large a sample is needed to give us confidence that an obtained 
vote of 55 per cent in favor of the proposition means a majority sentiment 
in that direction and did not occur by random sampling from a population 
that is on the fence. If a discrepancy of as much as 5 per cent is to be 
Significant in our accepted meaning of the word, 5 per cent must deviate 
as much as 1.960 from the mean of a normal distribution. In terms of 
Proportions, the deviation is .05; how large must ep be? Obviously it 
must be such that .05 is 1.96 times øe. gp is therefore equal to .05/1.96, 
which equals .0255. ‘The formula we need is 


N= wg (Size of sample needed for significant deviation) (11.2) 
oy 
We know p and q and cp already. Substituting them in the equation, 
we have 


1 For the case of stratified sampling that is usually applied in public-opinion polling, 
modifications in line with standard-error formulas that fit that situation should be 
applied (see Ch. 9) rather than the general one for completely random sampling that is 


illustrated here. 


272 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


X55 
0255? ~ 100065025 


in 


N= 384 

to the nearest whole number. It is therefore a 19 to 1 bet that when a 
vote comes out with 55 per cent in favor of an issue in a sample of 384 
that the population sampled is not evenly divided on the question. 

But where much is at stake, we should not be satisfied with these odds 
against the null hypothesis. We might ask how many votes need to be 
sampled to assure us of a very significant deviation. In this case, the 
excess of .05 must be at 2.5760 from the mean. The o>» must be .05/2.576, 
which equals .0194. Applying formula (11.2) to determine N, we have 
5X5 25 
-0194? -00037636 oo 


Thus, in a sample of 664 interviewees, a majority vote of 55 per cent 
would be regarded as very significant. The odds would be 99 to 1 that 
the sentiment of the population sampled is not evenly divided on the 
issue. And since the deviation is in the direction favoring the issue, we 
strongly expect future outcomes to be in the same direction, but we do 
not know by how much. 

The sizes of samples just found are surprisingly small in view of the 
enormous populations that vote on national issues and whose sentiment 
they may be expected to estimate. The reason is that we have allowed a 
rather wide margin of .05 as the deviation from null hypothesis. In deal- 
ing with more vital issues, where close elections are concerned, excesses of 
-01 or less may be decisive. If we are interested in the sizes of sample 
required to give significant and very significant indications when the vote 
is .51 to .49, the SE of the proportion must be one-fifth as large as it ù 
was for a .55 to .45 division. If an is one-fifth as large, o, is one twenty- 
fifth as large. In this particular problem, the numbers to be substituted 
in formula (11.2) are now the same except that the denominator is one 
twenty-fifth of its former size. This makes WV twenty-five times as large 
as before. 

For a deviation of .01 to be significant now, V must be 9,600 and to 
be very significant it must be 16,600, these numbers being 25 times 384 
and 664 respectively. Samples of this size would give us great assurance, 4 
granting random sampling, that the sentiment is in the direction indi- 
cated. On many issues, of course, the sentiment is more unevenly bal- 
anced than .55 and 45. And, again, when we are interested in significance 
of changes in sentiment, we have a revision of our problem, for then we 
are dealing with differences among proportions, a kind of problem to 
which we now turn. 


N= 


TESTING HYPOTHESES 273 


Cut SQUARE 


Consider the data in Table 11.3, where we have a comparison of two 
samples; one is of 206 young American males who when they were in 
school had been regarded as feebleminded in terms of JQ. Their JQ’s 
were in the range 60-69. The other group is of 206 men of similar age 
(in the twenties) of JQ’s near 100.! These two groups had been compared 
with respect to a number of variables, one of which was marital status. 
At the time the study was made, the proportions married in the two 
groups were .539 and .408 for the normal and feebleminded groups, 
respectively. Is this difference significant? 

Chi Square as a Test of a Null Hypothesis.—To answer the question 
just asked we might resort to the ¢ test described for such a purpose in 
Ch. 9. We have an alternative procedure in the statistic known as chi 
Square. It can be applied to this problem and also to a wider range of 
problems where a ¢ test cannot be made. We will illustrate it first with 
this simple case. For comparison with the ¢ test, it might be said that 
the obtained difference (.131) is 2.66 times its standard error, indicating a 


confident rejection of the hypothesis of no difference. b. 


Besides assuming no real difference in proportion married, we ca 
formulate a null hypothesis in another way to fit the chi-square approach. 
We make the same basic assumption in both cases: we adopt the idea 
that the two groups arose by random sampling from the same population. 
We then ask the question, “If this be true, how likely is it that the 
distribution of cases like those obtained could depart as much as they do 
from a random or chance distribution?” The four frequencies, in the 
four cells of Table 11.3, are 111, 84, 95, and 122. There seems to be 
Some tendency for a concentration of cases in two cells: married-normal 
and unmarried-feebleminded. On the face of it, this looks like a mean- 
ingful departure from a random distribution. 

If the distribution were random, what would it look like? We must 
determine the answer to this question, for that is the distribution called 
for by the null hypothesis. We do this entirely from the marginal totals, 
7.e., the sums of rows and of columns. We take these values to be fixed, 
not having any better information about the genuine proportions of 
Married versus unmarried and of feebleminded versus normal in the 
general population. Actually, we are not very much concerned about 
those proportions. We are abstracting two variables for study—marital 
Status and intelligence—and we are interested in those qualities as such. 

1 From Baller, W. R. A study of the present status of adults who were mentally 
deficient, Genet. Psychol. Monogr., 1936, 18, 165-244. 


€ 


274 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The experiment naturally brings about certain restrictions in order to 
exert certain experimental controls. It is within the framework of the 
investigation that we are talking about a population. This fact is the 
logical justification for accepting the marginal sums as fixed. Within 
those limitations there is considerable freedom of variability. 

The random distribution which will describe the null hypothesis is com- 
posed of four cell frequencies corresponding to those already mentioned: 
111, 84, 95, and 122. We use the symbol f, to stand for these observed 
frequencies and the symbol f. to stand for the expected frequencies. How 
shall we find the expected frequencies? 

Computation of Chi Square in a Contingency Table.—Reference to 
Table 11.3 shows that of the total sample of 412, 195 were married and 


TABLE 11.3—A Comparison of MEN or NORMAL IQ with FEEBLEMINDED MEN WITH 
Respect To MARITAL STATUS 


Marital status Normal |Feebleminded | Both 
Married............. 111 84 195 
Unmarried........., 95 122 217 

iCal steerer op 206 206 412 


TABLE 11.4.—THE EXPECTED NUMBERS OF MARRIED AND UNMARRIED MEN IN THE 
NORMAL AND FEEBLEMINDED GROUPS Hap THERE BEEN No DIFFERENCE 
BETWEEN THE Two ’ 


a Marital status Normal | Feebleminded | Both 


Meda 97.5 97.5 195 
Unmarried sal 108.5 108.5 217 
Totale aeaye 6 206 206 412 


217 were not. The proportions are .4733 and .5267, respectively. These 
proportions we may take to describe the (assumed) single population- 
By random sampling, each group, normal and feebleminded, should show 
the same proportions—.4733 married and .5267 unmarried. These pro- 
portions of 206 (for the normal group) give 97.5 and 108.5 married and 
unmarried persons, respectively. These are products (206 X .4733) and 
(206 X .5267), respectively. Since the feebleminded group also num- 
bered 206, it would be expected to have the same frequencies. The 
entire set of expected frequencies is given in Table 11.4. If we add the 
columns and rows we find that the sums are the same as for the observed 
frequencies. It is always well to check one’s work in this manner. 


+ 


+ 


TESTING HYPOTHESES 275 


TABLE 11.5—DISCREPANCIES BETWEEN OBTAINED AND EXPECTED FREQUENCIES IN 
TABLES 11.3 anp 11.4 


Marital status Normal | Feebleminded 
Martii. ves anA 13.5 =13:5 
Unmarried.......... —13.5 13.5 


TABLE 11.6.—Tue CELL- SQUARE CONTINGENCIES FOR THE COMPUTATION OF CHI 
SQUARE RELATIVE TO THE STUDY OF MARITAL STATUS AND INTELLIGENCE 


Marital status Normal | Feebleminded | Both 
Miian 1.87 1.87 3.74 
Unmarried.........-- 1.68 1.68 3.36 

TGP eects E: 3.55 3.55: 7.10 


Computing Expected Cell Frequencies.—In a contingency table of any 
number of rows and columns, the principles of computing the expected 
cell frequencies can be illustrated by the limited 3.x 3 table shown in 
Table 11.7. Let the f’s with double subscripts stand for the obtained 
frequencies. The sums of the rows are symbolized by Efa, Efo, Efe, etc., 
and the sums of columns by Zfı, Efe, Zfs, etc. The expected frequency 
for any cell in row r and column & can be found by the formula 


ss 
i= GACH (Expected frequency for a cell in row r and column &) Gas) 
L 


TABLE 11.7—ScHEMA AND SYMBOLS FOR COMPUTATION OF EXPECTED CELL 
FREQUENCIES IN A CONTINGENCY TABLE 


Columns 
Sums of 
Rows rows 
1 2 3 
A Sar Jaz Sas b>) 
B for fur Soa Efo 
C fa fe tes Efe 
Sums of columns........ Xfi Ifa If: N 
Let 3f, stand for a sum of any rows, ¢.g., Efa, Ifo, . . . , etc. 


Let =f; stand for a sum of any column, e.g., Bf, Efa, -.. - 


276 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Thus, the expected frequency corresponding to fès would be derived from 
the product (2f) (Zfs) divided by N. Hence, the expected frequency 
for row 1 and column 2 of Table 11.4 would be equal to 


(195)(206) _ 40170 


412 412 


= 97.5 


Computing Cell Discrepancies —Having the expected frequencies fe, we 
now ask whether the observed frequencies fo deviate from them sufficiently 
to cause us to reject the hypothesis of no difference. For each of the 
four cells of the table, we determine the discrepancy fo — fe. These dis- 
crepancies are listed in Table 11.5. It will be seen that except for algebraic 
sign they are all numerically the same. This outcome will be true of all 
fourfold tables of frequencies of this sort, whether the two groups com- 
pared have the same total numbers of cases or not. This fact can be 
used to give us short cuts in computation, as we shall see later. 

The Cell Square Contingencies.—In the solution of chi square, we square 
each discrepancy, divide by the corresponding fe and sum all the ratios. 
The sum is chi square. In terms of a formula, 


2 (fo — fe)? 
i ee r: [e] (General formula for chi square) (11.4) 


where the symbols have been explained above. Each cell provides a 
ratio of (fo — fe)? to fe which ratio has been called the cell square con- 
tingency. This is merely a convenient name, at present, but later (Ch. 14) 
it will be related to prediction procedures. For now, it can be said that 
chi square is the sum of the cell square contingencies in a contingency 
table (see Table 11.6). N 
The square of the discrepancy 13.5 is 182.25. In two cells, this is to 
be divided by 97.5, which yields 1.87. In the other two cells it is to A 
divided by 108.5, which yields 1.68. Summing twice 1.87 and twice 1.68, 
we have 7.10 as the value of x?. tail 
Interpretation of a Chi Square.—The number 7.10 stands for the tota 
amount of discrepancy between hypothesis and observation. Chi square 
can be small enough to allow us to accept the null hypothesis or to oe 
it with some doubt, or it can be large enough to lead us to reject nage 
hypothesis with moderate or with positive assurance. Like eae 
ratio, it can be interpreted as being significantly or very sigmaan y 
large, i.e., of being so large that sampling alone could account Faa 
results only once in 20 times, or once in 100 times, as the case a 3 
Degrees of Freedom.—Tables of chi square (see Table E, Appen 


es 
ble us to decide the matter. But we must know the number of degre 
enab 


TESTING HYPOTHESES 277 


of freedom df before we can use the table. In a fourfold table such as we 
have here, there is only 1 degree of freedom. 

Let us see how it is that we have only 1 degree of freedom. Remember 
that we have taken the row and column sums to be fixed. This injects con- 
siderable rigidity into a contingency table. The general rule applying to 
all contingency tables regardless of numbers of rows and columns is 
that the degrees of freedom equal the product of the number-of-rows- 


Freguency 


—> as. F 
0o;234567 8 9 10 Il 12 13 14 15 16 17 18 19 20 
Scale of Chi square (X?) 
Fic. 11.3.—S ling distribution of chi square for various degrees of freedom. (After 
Lewis, D. Quantitative methods in psychology. Iowa City: The author, 1948.) 
minus-one and the number-of-columns-minus-one. If there are r rows 
and & columns, both r and & being greater than 1, 
= BS =a Number of degrees of freedom in a con- 5 
af T G 1)(k 1) ¢ tingency table of r rows and k columns) a 5) 
In a 2 X 2 table, applying the formula, we would expect 1 degree of 
freedom. This is made reasonable by the following logic. Once we have 
chosen a single cell frequency, with the row and column sums being what 
they are, all the other cell frequencies are determined; are not free to vary. 
This is reflected, also, by the fact that there is only one value for the cell 
discrepancies. 
The Sampling Distribution of Chi Square.—The importance of degrees 
of freedom can be seen in connection with Fig. 11.3, which shows the 
Sampling distributions of chi square for a number of different degrees of 


278 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


freedom ranging from 1 to 10. It is because of these known distributions 
that the tables for interpreting a chi square could be constructed. In 
general, distributions of this statistic are positively skewed, and the 
smaller the degree of freedom, the greater the skewness. As the num- 
ber of degrees become large, this distribution approaches the normal curve 
in form. The distribution with 10 degrees of freedom is apparently not 
far from normal. 

Use of the Chi-square Tables—Is our chi square of 7.10 significant? 
Table E shows that when df = 1 the largest chi square given is 6.635. 
Right above this is the probability of 01, which means that a chi square 
as large as 6.635 or larger could occur by chance alone only once in 100 
times. Our chi square of 7.10 is larger than 6.635 and therefore could 
occur in the same manner less than once in 100 times. We therefore 
regard it as very significant and reject the hypothesis of no difference 
between the two groups. 

Relation of Chi Square to t—When there is 1 degree of freedom in a 
contingency table, chi square is equal to #°, or ¢ is equal to chi, the square 
root of chi square. The square root of our chi square obtained for the 
marital data, namely 7.10, is equal to 2.66. This checks exactly with 
the ¢ that was reported in an earlier paragraph. A £ test and a chi-square 
test of the same statistics will therefore lead to the same inferences when 
there is 1 degree of freedom. L 

Chi Square When Frequencies Are Small.—When applying the chi- 
Square test to a problem with 1 degree of freedom, particularly, it is 
important to exercise caution when samples are small. When any of 
the theoretical cell frequencies are small (some say as small as 50; in the 
opinion of the author 25 might be a more realistic limit), it is recommended 
that a correction known as Yates’s correction for continuity (correction for 
discontinuity would be a better name for it) be applied. This correction 
consists of reducing by .5 each of the obtained frequencies that are greater 
than expected and of increasing by .5 each of the frequencies that are less 
than expected. This has the effect of reducing the amount of each of the 
discrepancies between observed and expected frequencies by .5 (without 
changing the expected frequencies). It is especially important to apply 
such a correction when the chi square turns out (without correction) use 
above the customary confidence limits. The effect of the correction 1s to 
reduce the size of chisquare. The correction is needed here for the same 
reason it was suggested in connection with a distribution of scores (see 
p. 267). There it was pointed out that with a limited number of ay 
units in the range, testing whether a given score was significant involve 
reducing a deviation by .5 unit in order to get to the point at its lower 


TESTING HYPOTHESES $ 279 


limit. Frequencies, like scores, increase by whole units, whereas dis- 
tributions of sampling statistics like ¢ and chi square are on a continuous 
scale. When scores or frequencies are numerically large, a change. of 
5 unit is relatively unimportant, but when the number of either is limited, 
a change of .5 unit amounts to quite a percentage change. 

There are lower limits to utilizable frequencies, even when Yates’s cor- 
rection is applied. Some authors say that a chi square should not be 
computed if any theoretical frequency is less than 10. Others, more 
generous, would compute chi square even when a theoretical cell fre- 
quency is as low as 2. A realistic limit is 5. If there are more than four 
cells in the contingency table, there is always the possibility of combining — 
cells so as to avoid very low frequencies. An illustration or two of this will 
be shown later (see Table 11.10). There is probably nothing to be gained 
by applying Yates’s correction when there ismore than 1 degree of freedom, 
and under these conditions the correction becomes complicated.* 

An Example of Yates’s Correction.—In a public-opinion poll conducted 
not so many years ago, sentiment was sampled concerning attitudes 
toward radio newscasts.2 Some 43 interviewees in one sample were asked 
the question, “Do you find it easier to listen to news than to read it?” 
The sample had been stratified into a higher and a lower socioeconomic 
status, 19 being in the former and 24 in the latter. The numbers respond- 
ing “Yes” to the question in the two groups were 10 and 20, respectively. 
The problem to be investigated is whether there is a real difference 
between the two groups in their opinions on the question. 

The data have been arranged in the usual manner in Table 11.8. 
The frequencies are all small enough, certainly, to call for Yates’s cor- 
rection. One obtained frequency is less than 5, but we pay more atten- 
tion to the theoretical frequencies before we may decide to give up the 
idea of a chi-square test. The expected frequencies are given in Table 
11.8. All of them exceed 5, so we may proceed with the test. 

Let us carry through the test first without the correction and then with 
it to see what difference it may make in our conclusions. Without the 
correction, the cell deviations would all equal 3.26. This squared is 
10.63. Applying formula (11.4) and solving, we find that chi square 
equals 4.76, which is significant between the 5 per cent and 1 per cent 
levels. With the correction applied, the cell deviation in all cells is 2.76 
rather than 3.26, whose square is 6.72. With this solution, chi square 

1 Kelley, T. L. Fundamentals of statistics. Cambridge: Harvard University Press, 
1947. P. 330. 

2? From Cantril, H. The role of 
3, 654-662. 


the radio commentator. Publ. Opin. Quart., 1939, 


280 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


becomes 3.42, which fails to reach the 5 per cent of significance. One 
would have much more confidence in the interpretation of the second 
outcome than the first. Not always will the correction make as much 


TABLE 11.8—Computation OF CHI SQUARE FOR RESPONSES OF Two SOCIOECONOMIC 
GROUPS TO PREFERENCE FOR Rapio News TO READING A NEWSPAPER 


Obtained frequencies Expected frequencies 


Response Socioeconomic group Socioeconomic group 


Higher | Lower | Both | Higher Lower | Both 


10 20 30 13.26 16.74 30 
9 4 13 5.74 7.26 13 
19 24 43 19 24 43 


» 
difference and not always will it lead to a drastic change in the conclusion. 
When in doubt, use it. 
Other Ways of Computing Chi Square in a 2 X 2 Table.—In a four- 
fold-table problem, since the discrepancy is the same for all cells, the 
formula for chi square can be written 


a Uo =d’ y G) a na in a 2 X 2 contingency (11,6) 


That is, chi square equals the common discrepancy squared times the 
sum of the reciprocals of the four f.’s. As applied to the marital-status 
problem 


ofa 1 1 1 
a= 13.5¢( ) 
x 975 t 97.5 + 1085 + T085 
= 182.25(.01026 + .01026 + .00922 + .00922) 


182.25  .03896 
= 7.10 


If the data are arranged in a 2 X 2 table as shown in Table 11.9; 
another convenient formula for the computation of chi square is 


7 os 2 (Alternative formula for 
x Nid — te) chi square ina four-cell, (11.7) 


CuFt ACFA) 2 3 table) 


Applied to the opinion-poll data, 


TESTING HYPOTHESES 281 


a _ 43{(10)(4) — (20)(9)F 
x GO0NCHU3) 
_ 43(40 — 180)? 
(570)(312) 
_ 842,800 
= 177,840 
= 4.74 


The answer is close enough to that obtained by the other method to be 
regarded as checking. Note, however, that this is the result without 
correction for continuity. 

TABLE 11.9—Syamoric ARRANGEMENT OF DATA IN A 2X 2 CONTINGENCY TABLE 


ILLUSTRATED BY THE PUBLIC-OPINION DATA 
Variable IL Socioeconomic group 


Higher | Lower | Both Higher | Lower | Both 


a+b 


Variable I 


c+d 


Chi Square in Other than 2 X 2 Tables.—The use of chi square is by 
no means limited to fourfold contingency tables. It can be applied with 
as few as two cells and with a much larger number. First, an example 
with only two frequencies to be tested. 

In a Two-cell Table-—For this purpose let us use the polling data on 
preference for the radio. Combining the two socioeconomic groups 
together as representing 4 more heterogeneous population, we may be 
interested in knowing whether the population they represent is actually 
in favor of radio newscasts. The sample is so small that there may be 
some doubt. The frequencies are 30 in favor and 13 not. Could these 
frequencies have arisen from a population in which the division of opinion 
is really evenly divided? The null hypothesis for this purpose is a 50-50 
division. This is an arbitrarily chosen hypothesis; we could have chosen 
some other, such as a 60-40 division of opinion. 

With the 50-50 hypothesis chosen, the expected frequencies are 21.5 
and 21.5, these being one-half of 43. The cell deviations or discrepan- 
cies (fe — f.) are 8.5, one positive and the other negative (without cor- 
rection for continuity). The squared discrepancy is 72.25. Dividing 
this by fe which is the same in both cells, we get a squared contingency 
of 3.360 for each cell. For the two combined we get 6.720, or a chi square 


282 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of 6.72. This is significant just beyond the 1 per cent level. With so 
small a sample, we should apply the correction for continuity, in which 
case the cell discrepancies are 8.0 and squared they are 64.0. The cell- 
square contingencies are 2.977 and chi square is 5.95. This figure leads 
us to a modification of our confidence ina real difference, though it does 
not reverse our decision about rejection of the 50-50 hypothesis. 

With a two-cell table, when expected frequencies are equal, as in the 
last illustration, the formula for chi square reduces to the simple form 


YE 2(fo — fe)? (Chi square in a two-cell table when expected (11.8) 
x the frequencies are equal) 


Since, with one degree of freedom ¢ = x, another formula for ¢, derived 
from (11.8) and applying in the same special but not uncommon situ- 
ation, ist 

pee ia 
Vifit fe 


where fı = the larger of two frequencies and fı +fe=N. 
Applied to the polling problem, 


(11.9) 


( test of departure of two frequencies from equality) 


30 — 13 


The square of this value is 6.71, which checks with the chi square obtained 
above, without correction for continuity. Correction for continuity would 
involve the use of the expression (fı — fz — 1) in the numerator of formula 
(11.9) in place of (fı — fr). 

Chi Square in Larger Tables of Frequencies—To illustrate the appli- 
cation of chi Square to a larger table, this time with a‘ table of six cells, 
let us consider some more survey-of-opinion data.? This time the ques- 
tion was whether the radio listener agreed with the opinions expressed 
by a certain radio commentator, and the responses were tabulated as 


1 By a little algebra, it will be found that (fi — fe) = 2(f. — fe) and that 
teo= (fa + fo)/2. 
Equation (11.8) then becomes 


= (h-hh) 
fee 
? Cantril, op. cit. 


TESTING HYPOTHESES 283 


“Agree,” “Disagree,” or “Doubtful.” The survey was made in two 
cities and we have the numbers responding in each way in both of them. 
The results are listed in Table 11.10. 


TABLE 11.10 —A CHI-SQUARE SOLUTION IN A Two-By-THREE TABLE OF DATA ON 
OPINIONS EXPRESSING AGREEMENT OR DISAGREEMENT WITH A CERTAIN RADIO 


COMMENTATOR 
a E j a 


Opinions in | Opinions in 


Categories of response Syracuse | Columbus Both 
Agree... eere 73 22 95 
Disagree. 9 4 13 
Doubtful ` 41 27 68 

Totals. sisi ecse tenes 123 53 176 
OOM O OE aIl 

(fo — fe)? (fo — fe)* 
Je í ie Je x Discrepancies ify 
Expected frequencies Discrepancies pare Ratios 


Syracuse |Columbus} Syracuse Columbus} Syracuse Columbus| Syracuse |Columbus 


66.4 28.6 +6.6 —6.6 |: 43.56 | 43.56 0.66 1.52 

A 3.9 —0.1 +0.1 0.01 0.01 0.00 0.00 
47.5 20.5 —6.5 +6.5 42.25 42.25 0.89 2.06 
123.0 53.0 0.0 = = 1.55 3.58 


The derivation of the expected frequencies was càrried out with the 
application of formula (11.3). From here on, the work as recorded in 
Table 11.10 is just as we have done previously. The sum of the square 
contingencies is 5.13. The degrees of freedom (according to formula 11.5) 
are 2X 1=2. For 2 degrees of freedom the tables of chi square show 
that it requires a chi square of 5.991 to be significant at the 5 per cent 
level. Our chi square falls below this level, so there is no really con- 
vincing reason to doubt that the two populations sampled are alike on 
the question at issue, though there are less than 10 chances in 100 that a 
chi square as large or larger could have arisen by chance. E 

The small expected frequencies in Table 11.10 should raise some question 
concerning the need for corrections. 

Combining Columns or Rows.—As a matter of fact, we have one expected 
Tf we decide that it is too risky to solve the 
one thing we could do. Incidentally, 
blem that the squared discrepancy 


frequency lower than 5i 
problem with so small an fe, there is 
it happens in this particular pro 


284 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


(fo — f)? was practically zero for the cell in which f, was smallest, so that 
this cell makes no contribution to chi square. It is a situation in which a 
very small f. is combined with a relatively large squared discrepancy that is 
serious, for then the cell’s contribution to chi square is unduly large and yet 
of doubtful stability or meaningfulness. 

If we had combined the “Disagrees” with the “Doubtfuls” in this 
problem, we should have had observed frequencies of 50 for Syracuse and 
31 for Columbus, with expected frequencies of 56.6 and 24.4, respectively. 
We can combine both observed and expected frequencies after the latter 
have been computed in uncombined form. After this kind of a combina- 
tion is made, the size of chi square is likely to be smaller than before, though 
not always. Even though it is smaller, the number of degrees of freedom 
is also reduced and the significance limits are accordingly smaller, so that 
the chances of a significant departure of data from the null distribution are 
presumably about the same as they were. 

Chi Square in Testing the Hypothesis of Normal Distribution —One 
convenient use of chi square is in testing whether or not a set of observed 
frequencies in a frequency distribution could probably have arisen from a 
normally distributed population. The procedure is carried out in much 
the same manner as with frequencies in this chapter. Expected fre- 
quencies are estimated as was illustrated in Ch. 7, particularly in Table Tle 
The discrepancies between observed and expected frequencies are squared, 
divided by the f.’s, and these ratios are summed to give x2. The number 
of degrees of freedom is the number of class intervals less three. One 
degree of freedom is lost in computing the mean, one in computing the 
standard deviation, and one for N, the size of sample. All three of these 

» Statistics are used in deriving the expected frequencies, since the three 
of them describe the particular normal curve that comes closest to the data. 

At the tails, where /,’s are small (less than 5), two or more class intervals 
should be grouped together as was suggested in recent paragraphs for 
other data. The interpretation of the result is made according to prece- 
dents already repeated, and the hypothesis of normality is accepted or 
rejected according as x? is small or large. . 

Using the data of Table 7.1, with the expected and obtained frequencies 
already given, we will make the chi-square test. First, to get rid of very 
small tail frequencies we will combine the four classes at the upper end of 
the distribution and the three at the lower end. The results of this are 
shown in Table 11.11. The next steps are carried out, as shown, with a 

resulting chi square equal to 3.68. With 5 degrees of freedom, a chi square 
of 11.07 is required for significance at the 5 per cent level. From the chi- 


square table, we find by interpolation that about 60 per cent of the chi 


TESTING HYPOTHESES 285 


TABLE 11.11.—A CHI-SQUARE TEST OF THE NORMAL-DISTRIBUTION HYPOTHESIS APPLIED 
TO A FREQUENCY DISTRIBUTION OF SCORES 


(1) (2) (3) (4) (5) (6) 
Origi d 
bd |) a o e 
Scores ERONDAGI a discrepancies | discrepancies OR ERBEnC eS 
fo-fe | azr | Late 
fo fe Ío Se fe 
44-46 0 0.2 
41-43 1 0.8 
10 8.2 +1.8 3.24 0.395 
38-40 4 22 
35-37 5 5.0 
32-34 8 9.0 8 0 —1.0 1.00 0.111 
29-31 14 | 13.3 | 14] 13.3 +0.7 0.49 0.037 
26-28 17 | 15.8] 17] 15.8 +1.2 1.44 0.091 
23-25 9| 15.1 9 5.1 -6.1 37.21 2.464 
20-22 wi 1427} AS] i +1.3 1.69 0.144 
17-19 8 L 7.2 8 1.2 +0.8 0.64 0.089 
14-16 3 3.6 
11-13 4 1.5 7 5.6 +1.4 1.96 0.350 
8-10 0 0.5 
= 86 | 85.9 | 86] 85.9 +0.1 x? = 3.681 
(|) ee eee 


squares from data such as these could be as large as 3.68 or larger. We 
conclude that the obtained frequency distribution fits the normal form 
quite acceptably. We can well tolerate the idea that the population 
from which it came is normally distributed on the measurement scale. In 
making this chi-square test, it is important that 3f, shall approach N 


very closely. 


Exercises 


1. Suppose that we ask an observer to arrange a series of weights in rank order from 
lightest to heaviest, the differences being very small. If he places them in perfect rank 
order, what is the probability that he could have done so by sheer guessing? No 
matter how many weights ranked, there is only one correct way of doing this. The 
total number of ways the observer could have arranged each number of weights is 


given below: 
6 7 
720 | 5,040 


> 


Number of weights 
120 


Number of orders. . -~ -- -+-+ 


286 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Which perfect orders would be regarded as “not significant,” “significant,” and “very 
significant”? State the probabilities of perfect orders by chance. 

2. An observer knows that he will hear one of three similar speech sounds. He is 
given the three in chance order in a total of 30 trials. How many correct judgments 
must he give before we regard his success as significant and as very significant? 

3. Suppose that the observer in Exercise 2 were given 48 trials. How large a score is 
significant, and how large a score is very significant? 

4. A certain examination includes 40 items, each item with four alternative responses. 
How large a score must a student earn before you feel that he probably knows something 
about the content of the examination? Before you feel that he undoubtedly knows 
something about it? Would you feel absolutely sure that he knows something about 
the content if he made a score of 35? Discuss. 

5. In a test of five-response items, how many items would you need to include in 
order to feel sure that a score of 30 per cent right indicates knowledge of the content? 
How large must the test be if a score of 25 per cent right is to indicate knowledge 
beyond a reasonable doubt? Tell how you have interpreted “sure” and “beyond 
reasonable doubt.” $ 


Data 114.—NUMBER OF PERSONS IN Two Groups, DEPRESSED AND NoT DEPRESSED 
IN TEMPERAMENT, WHO RESPONDED IN EACH OF THREE CATEGORIES TO 
THE QUESTION, “WouLD You RATE YOURSELF AS AN IMPULSIVE 
INDIVIDUAL?” 


Group 


Depressed.........2..05 72 45 
Not depressed. . 


6. Is there a significant difference in Data 114 between the numbers of “Yes” 
responses? Present statistical proof. 

7. Is there a significant difference between the two groups in the number o! 
responses? Explain. 

8. Is there a significant difference in the two groups with regard to all three response 
categories taken together? Determine this by computing chi square. 

9. State a number of null hypotheses that might be applied to Data 11B. 


fp” 


Data 11B.—Nuspers or Two Groups DIFFERING IN ABILITY WHO PASSED A CERTAIN 
Test ITEM 


TESTING HYPOTHESES 287 


10. Do both groups together in Data 11B show a significant deviation from a chance 
situation of passing and failing? Explain. 

11. Is there a significant difference between the high and low group in terms of the 
numbers passing the item? Explain. Can you predict from this result whether there 
would be a significant difference between numbers of failures in the two groups? 
Explain. e 

12. Find a chi square for Data 11B in as many ways as you know how. Interpret 
your results. 

13. In Data 11A, combine the “Yes” and “2” responses, and compute chi square 
for the fourfold table. Compare your results with those in Exercise 8. 


CHAPTER 12 
TEST SCALES AND NORMS 


In this chapter we take a short departure from fundamental statistics but 
we apply statistical ideas to the problems of measurement by means of 
tests. Heretofore, we have accepted raw test scores as if they were meas- 
ures of psychological or educational variables on scales of equal units. 
Strictly speaking, this assumption is necessary before we are justified in 
applying many of the statistical procedures, including arithmetic mean, 
standard deviation, and correlation coefficient. When a test is composed 
of many items, when it covers a relatively wide range of values, and when it 
is of an appropriate level of difficulty for the population examined, this 
assumption is fairly sound. Now, however, we must examine the question 
of measuring scales, not so much for their suitability in meeting the assump- 
tion of equal units but for the very practical reason of comparability and 
meaningfulness of scales. S 

Why Common Scales Are Necessary.— The chief reasons for dissatis- 
faction with most raw-score scales are dhaiNlack of meaning and their lack 
of comparability. / Aside from a few tests that yield scores in terms of 
physical stimulus values (such as tests of sensory acuity) or of response 
values (such as time, distance, or energy values), most tests yield numerical 
values that have no particular significance. There was a time (unfortu- 
nately it still is not entirely in the past) when scores were given in terms of 
percentages. The tradition of grading examinations in terms of percentage 
of right answers still has popular appeal, in spite of the many experi- 
mental demonstrations that such percentages are neither accurate nor 
meaningful. The method gave a feeling (definitely falacious) of having 
some kind of an “absolute” measure of the individual. It is difficult for 
even the better informed student to free himself from this traditional think- 
ing, even when he has given up the operations it implies. 

If modern psychology and educatiorhave taught anything abont mea: 
urement, they have amply demonstrated the fact that there are few, if any, 
absolute measures of human behavior. The emphasis has shifted from the 
search for absolute measures to an emphasis upon the concept of individual 
differences. The mean of the population has become the reference point; 
and out of the differences between individuals has come the basis for 


scale units. Even when the test happens to yield such objective scores 
288 


TEST SCALES AND NORMS 289 


as those in time, or space, or energy units, it is sometimes doubted that such 
units, though undoubtedly equal from a physical point of view, really 
represent equal psychological increments along scales of ability or talent. 
These considerations, among others, send us in search of more rational 
and meaningful scales of measurement for behavior events. 

In addition to the more theoretical demands just mentioned, there is the 
very practical consideration that scales for different tests should be com- 
parable. The most obvious need for comparable scales is seen in educa- 
tional and vocational guidance, particularly when profiles of scores are 
utilized. A profile is intended to give a picture of an individual. We 
would hardly bother to prepare one for an individual if we did not expect 
to make very direct comparisons of the person’s levels in different traits. 
The comparisons of trait positions for the same individual would be mis- 
leading, if not worthless, if there were not at least reasonable comparability 
of levels for different scores going under the same numerical value. No 
informed person would think of using raw scores as a basis of making 
direct comparisons among an individual’s positions with respect to trait 
variables. Conversion of raw scores to values on some other, common, 
scale is essential. The use of centile rank positions was mentioned in an 
earlier chapter (Ch. 6). Centile values are suitable to the extent that they 
do make possible comparable values for different tests, they do use the 
mean (or median) as the main reference point, and they are easily under- 
stood by the layman. They serve their best purpose when measurements 
must be interpreted to the layman. But, for reasons which were stated 
earlier (Ch. 6), centile values have limitations which make them fall short 
of full usefulness to those who expect something more of measurements. 
Centiles, after all, are rank positions and do not represent equal units of 
individual differences. It is possible to have scales that probably provide 
units of equal size as well as comparability of means, dispersions, and 
form of distribution. 

Some Common Derived Scales.—The chief interest in what follows will 
be in such scales—those which achieve comparability of means, dispersions, 
and form of distribution. We will not go into the very popular mental-age 
concept or the JQ scale. As simple as those ideas may be, the achieve- 
ment of a battery of tests which will meet the requirements of age equiva- 
lents and appropriate distributions of JỌ involves statistical problems of an 
intricate nature which we cannot go into. Treatment of these problems 
may be found in references to McNemar and to Marks.! The three kinds 


f the Stanford-Binet scale. Boston: Houghton 


1McNemar, Q. The revision oi ] 
ling in the revision of the Stanford-Binet scale, 


Mifflin, 1942; Marks, E. S. Samp 
Psychol. Bull., 1947, 44, 413-434. 


290 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


of scales to be discussed here are the standard-score scale, the T scale, and 
the C scale. Their application to derivation of test norms and profile 
charts will be given attention. The treatment will be kept at a rather 
elementary level, emphasizing basic concepts. For a more advanced 
treatment of some of these problems the reader is referred to a monograph 
prepared by Flanagan. 


STANDARD SCORES 


An Example of the Need for Comparable Scores.—A concrete example 
will illustrate some of the ideas expressed above. A student earns scores 
of 195 in an English examination, 20 in a reading test, 39 in an information 
test, 139 in a general scholastic-aptitude test, and 41 in a nonverbal 
psychological test. Is he therefore best in English and poorest in reading? 
Could he perhaps be equally good in all the tests? From the raw scores 
alone, we can answer neither of these questions nor many others that could 
be legitimately asked. This student’s (student I) five scores just cited 
will be seen listed in column (4) of Table 12.1. Knowing the means of 


TABLE 12.1—A COMPARISON OF STANDARD SCORES WITH Raw Scores EARNED BY 
Two STUDENTS IN Five EXAMINATIONS 


® © © v 
x s— M: 
3 a 
x Deviations 
Jon Raw | p i Standard |in standard 
Examination Mean Fe a scores aaa scores scores * 
ation 
Ho | ae) yi | ee 
English....... ..]153.7] 26.4 |195|162|-+39.3]4 6.3/+1.49|+0.24] -98| -66 
Reading. .. --| 33.7) 8.2 | 20| 54|—13.7|+20.3|—1.67|+2.48 2.18]1.58 
Information: S RETA 54.5| 9.3 | 39| 72|-15.5]4+17.5|—1.67|/+1.88] 2. 18| .98 
Scholastic aptitude. .| 87.1] 25.8 |139] 4]4+51.9|— 3.1|+2.01|—0.12| .1-50/1.02 
Psychological........ 24.8] 6.8 | 41] 25]+16.2|4 0.2|-+2.38|-+0.03]_1-87| 87 
Shin caa pel (yee Moke 4341397) hes n was 42.54/4+4.51] 8.71 5.11 
DEGREE conan sc snse aon fess. nlps ance (Beal ac ee, coll ®..|-+0.51 0.90|--1.74|1.02 


* Disregarding algebraic signs. 


students in the five tests helps some, since they serve as norms or com- 
parable zero points. The means are listed in column (2). We now see 
that the student is well above average in English and in scholastic aptitude 
and is somewhat below average in reading and information, just as the 


1 Flanagan, J.C. Scaled scores. New York: Cooperative Test Service, 1939. 


eee eee 


Y 1 such common scale. 


TEST SCALES AND NORMS 291 


numbers seem to indicate at their face value. The second student, whose 
raw scores are also in column (4), is numerically highest in the same two 
and. lowest in the same three. When we consider the averages again, 
however, we find that student II is only about average in English, in 
scholastic aptitude, and in the psychological test, but he is above average 
in reading and in the information test. 

When a student is above the mean in two tests, in which one is he 
actually superior? Student I is 39.3 points above the mean in English 
and 16.2 points above the mean in the, psychological test [see column 
(5) of Table 12.1]. Is his superiority in English really greater than his 
superiority in the psychological test? Student II is 20.3 points above 
the mean in reading and 17.5 points above the mean in information. 
Is he about equally superior in the two tests? 

And how do the two students compare? The superiority of student 
I is apparent in three tests (English, scholastic aptitude, and psycho- 
logical) and that of student II, in the other two tests. This we can 
tell from the raw scores. But suppose the two were competing for a 
scholarship at a university; which one, if there is to be a choice between 
the two, should win? The totals of the five scores are 434 and 397, 
in favor of student I. Granting that the five different abilities are 
equally important, have we done justice by comparing sums of raw 
scores? Should we be justified in finding an average of each student’s 
five raw scores? 

Suppose that we were interested in determining which student is 
the-more consistent in his abilities, as shown by these five tests, and 
which one has the greater variability within himself. Would a comparison 
of the average deviations or standard deviations of the five raw scores give 
us the answer? As the reader has probably guessed, the reply to most of 
these questions is in the negative. We are extremely limited in making 
direct comparisons in terms of raw scores for the reason that raw-score 
scales are arbitrary and unique. We need a common scale before such 
comparisons as we have called for can be made. Standard scores furnish 


The Nature of a Standard-score Scale.—A standard-score scale is one 
that has a mean of zero and a standard deviation of 1.0. The unit of the 
scale might be taken as 1c, or as 0.10, or any other arbitrary fraction of the 
standard deviation.)An illustration of the conversion of a raw-score scale 
into a standard scale is shown in Fig. 12.1, A, B, and C. Distribution A 
is based pon the original or raw scores. The mean is 80 and standard 
deviation is 14.0. The distribution is obviously somewhat negatively 


skewed. { 
\ 


292 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


As we have previously seen, a standard score s is derived from a raw score 
by means of the formula 


X—M 


— ¥ \(Standard score s corresponding to a raw score X (12.1) 
> ii A ] and to a deviation x) 


(o 
An intermediate step between the raw-score scale and the standard-score 
scale is the deviation Y — M, orx. This step is illustrated in Fig. 12.1 B. 


8 A 


2. — —L 4 — + > 2. 10 
=40 -30 -20 -10 0 lo 20 30 40 50 60 70 80 90 100 | 
Deviation-score scale (x) Original score scale (X) 


Mean=0 o=14.0 Mean =80 o= 14.0 
C 


= —— 
=30 -20 -10 0 +10 +20 +30 
Standard-score scale (z) 
Mean =0 o=1.0 


1 —L > 
20 30 40 50 60 70 80 
T-score scale (normalized) 
Mean=50 o=10.0 


10 20 30 40 50 60 70 80 
A standard scale (not normalized) 
Mean=50 o =10.0 


Maes hie : i- 
Fic. 12.1.—Distributions before and after conversion from a raw-score scale to a paa 
score scale with a desired mean and standard deviation, with and without norma 
the distribution. 


Deducting the mean from every raw score has the effect of shoving the 
entire distribution down the same scale so that the mean is zero. _ The 
final step, arriving at the z scale, is shown in Fig. 12.1 C. Distribution C 
is drawn so that the mean is directly beneath that in distribution B, both 
at zero, and so that deviations of 14 units on the original scale correspond 
with deviation of 1c on the standard scale. Especially to be noted is a 
fact that the form of distribution has not changed; it is still skewed exact ; 
as it was originally. This procedure does not normalize the distribution & 

some other scaling procedures do. agih 

Application to Comparisons of Scores.—The two students represente 
Table 12.1 will now be compared in terms of their standard scores. Before 


J 


TEST SCALES AND NORMS 293 


we take these comparisons very seriously, however, we must consider two 
possible limitations to this procedure. Applying formula (12.1), we arrive 
at the standard scores in column (6) of Table 12.1. For accurate compari- 
sons between different tests, there are two necessary conditions to be 
satisfied. The population of students from which the distributions of 
scores arose must be assumed to have equal means and dispersions in all 
the abilities measured by the different tests and the form of distribution, 
in terms of skewness and kurtosis, must be very similar from one ability to 
another. Unfortunately, we have no ideal scales common to all these 
tests, measurements which would tell us about these population param- 
eters. Certain selective features might have brought about a higher mean, 
a narrower dispersion, and a negatively skewed distribution on the actual 
continuum of ability measured by one test, and a lower mean, a wider 
dispersion, and a symmetrical distribution on the continuum of another 
ability represented by another test. Since we can never know definitely 
about these features for any given population, if we want to achieve com- 
munality of scales at all (standard or any other), we often have to proceed 
on the assumption that actual means, standard deviations, and form of 
distribution are uniform for all abilities measured. In spite of these 
limitations, it is almost certain that derived scales, such as the standard- 
score scale, provide us with more nearly comparable values than do raw- 
score scales. The recognition of these limitations, however, should be 
admitted and interpretations based upon the use of standard scores should 
be made with appropriate reservations in line with those limitations. 

Returning to Table 12.1, with the standard scores we have for the two 
students, we can now give more satisfactory answers to the questions 
raised above about these students. Student I is most superior in the 
psychological test, next in scholastic aptitude, and third in English. Had 
we judged this by his deviations from the mean, we should have decided 
that his order of superiority was scholastic aptitude first, English second, 
and psychological third. We find that in terms of standard scores he is 
equally deficient in reading ability and information, whereas the deviations 
would have placed him lower in information than in reading. Student IT’s 
five standard scores come in about the same rank order as do his deviation 
Scores but certainly not in the same order as his raw scores. 

When comparing the two students in terms of raw scores, we should 
conclude that student I has the greatest advantage in number of points 
in scholastic aptitude; in terms of deviations, this would be the same, 
but in terms of standard scores it is in the psychological test that the 
advantage is greatest. Student II has about the same superiority over 
student I in the reading and information tests in terms of raw scores 


294 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


and deviations but has decidedly greater superiority in reading ability 
in terms of standard scores. When we compare the two students as 
to total or average score, whereas the raw-score total gives student I 
the distinct advantage of 37 points, or an average superiority of about 
7 points, the standard-score averages reverse the order and give student 
II a 0.39s lead. In a-scholarship contest, we should conclude that 
student II has the greater all-round ability as indicated by these tests, 
when students are compared on a standard-score basis. 

A Measure of Individual Intravariability.—Studies of variability within 
persons (intravariability) have often resorted to the use of standard scores. 
In terms of them, is student I more or less variable than student II? 
Here the average deviation is a suitable mode of comparison. In column 
(7) of Table 12.1 is given the absolute deviation of each standard score 
from the student’s own mean score. The average deviations of these two _ 
students in the five tests are 1.74 and 1.02, respectively. In other words, 
student I is about 70 per cent more variable than student II. Although 
this is the usual procedure for determining intravariability, a word of 
caution is important. In using this procedure, we are assuming that in all 
the abilities measured the true variability of the group measured is the 
same. The standard-score scale makes the distributions all alike, with 
standard deviations equal to 1.0. Should we happen to have sampled @ 
group that is actually more variable in one ability than in another, we 
do not really have comparable units of measurement in all tests. This 

er also assumes tests of approximately equal reliability. i 
isadvantages of Standard Scores}—Although standard scores will 
do for us all that we have said and more, under the proper conditions, 
there are several things about them which make them less convenient 
than some other, (One shortcoming is the fact that half the scores will 
benegative in sign, which makes thingsawkward in computation. Another 
disadvantage is the very large unit, which is one standard deviation. 

We could, of course, overcome the first shortcoming by adding a con- 
stant to all the scores to make them all positive, and we could multiply 
them by another constant, preferably by 10, to make the unit smaller and 
the range in total units greater. I} we did both of these we could achieve 
almost any mean eg ee deviation we wanted depending upon the 
choice of constants. If we wanted a mean of 50 and a standard deviation 
of 10, we would multiply every standard score by 10 and add 50. . 

Direct Scaling to a Desired Mean and Standard Deviation—This 
brings us to a more general procedure. If we knew from the time that we 
had acquired the distribution of raw scores that we were to convert them to 
a common scale with a certain mean and standard deviation, we youd not 


TEST SCALES AND NORMS 295 


go to the trouble of converting first to standard scores then to the new 
scale. We can do the operation in one step by the equation* 


g (Conversion of scores in one 
m T. ms Os scale directly to compar- 
X= ©) Xo — | a M. — m] able scores in another (12.2) 
o, $ scale) 
where X, = a score on the standard scale, corresponding {0 Xe: 
X, = a score on the obtained scale;"a raw score. p 
M, and M, = means of X, and Xs, respectively. 
co and c, = standard deviations of X, and X,, respectively. 
Tf the desired mean is 50 and the desired standard deviation is 10, with 
these substitutions the equation becomes 


X,= ©) Xo — [(2) Ma — s] 
To To 


Knowing co and Mo from the particular distribution of raw scores, the 
equation reduces to very simple form describing a straight line. Taking 
the illustration of Fig. 12.1, where M, = 80 and so = 14.0, 


A l AN 
a (3) = 9 a so| 
= .714X, — 7.12 


A raw score of 100 would, by this formula, become a scaled score of 64. A 
raw score of 50 would become a scaled score of 29. We can see a graphic 
exhibition of this transformation by relating distributions A and D in Fig. 
12.1. A score of 100 in A is in a position comparable to a score of 64 in D, 
and a score of 50 in A is in position similar to 29 in D.? 

Scaling by this procedure, as by the standard-score method, assumes that 
the obtained form of distribution is the same as the population distribution. 
If this is true, then it is probable that units on the derived scale are equal, 
also those on the raw-score scale. So far as improving the equality of units 
is concerned, then, nothing has been gained, nor was anything to be gained. 
We know, however, that the form of distribution of a sample is not neces- 
sarily the form of distribution of the population. The discrepancy need 
not be, and probably is not, due to sampling errors, particularly if the 
sample is large. There are many reasons for radical departures of sample 
distributions from genuine population distributions of the trait measured: 


1 For the derivation of this type of equation, see Appendix A. 


2A more general discussion of such a transformation procedure will be found in Ch. 


19. Formula (12.2) here is the same as formula (19.8) there, with a change in symbols. 


» 


296 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


difficulty level of the test, intercorrelation of the items (see Ch. 17), and the 
variations in difficulty and intercorrelation. We should not, therefore, 
feel too obligated to retain the same form of distribution in scaled scores 
as in the raw scores. If there is a real discrepancy between population 
distribution and sample distribution, there is much room for improvement 
of thescalein terms of equality of units. The next methods to be described 
have the probable advantage that by normalizing distributions they also 
achieve better metric scales. 


Tue T SCALE AND T SCALING or TESTS 


C The well-known T scale overcomes the objections raised against 
standard scores and adds besides an advantage peculiar to itself. It 
adopts as its unit one-tenth of a standard deviation, so that an ordinary 
distribution with a range of 5 to 6c on its base line yields 50 to 60 integral 
T-scale scores. In addition, the T scale goes beyond any ordinary dis- 


“e -4s -3 20 e O sie iio ao io To 
Standard scores 
0 0 2 3% 4 5o 6 7 80 90 100 


T-Scale scores 
Fic. 12.2.—The T scale and its relation to the standard-score scale extending over a range 


of 10 sigma units. 

tribution, extending its scale over a spread of 10 standard deviations, OT 
100 units in all. Any age or grade group would yield its own distribution 
extending 5 to 6c. A group just higher in ability would overlap this one 
and yet would need an extension over new units beyond the limit of the 
first group. A third group of lower age would need an extension of the 
measuring stick at the other end. When all groups from lowest to highest 
are taken into account, considerable extension is required. The result, 
with these extensions, is a single common scale on which all groups, over @ 
wide range, have a common unit and a common zero point. It has been 
found in practice that a scale with 100 units (or 10c) will be extensive 
enough. It is based upon a normal curve whose tails extend from —5o to 
5a (see Fig. 12.2). Besides making the unit equal to 0.10, the T scale 
also has the zero point at the extreme left, which places it at —5e. The 
mean now becomes 50, and the other 7-scale points are spaced as in Fig- 


122 
e 


TEST SCALES AND NORMS 297 


McCall, who originated the T scale, suggested that the mean of this 
curve should be that of a representative twelve-year-old group. This 
mean was chosen because the twelve-year-olds are about midway along 
the scale of development. Since any limited sample of them would range 
over not more than about 60 units of the T scale, groups of higher and lower 
ability were required to complete the picture and to determine what kind of 
performance comes at 80 to 100 points at the upper end of the scale and 
0 to 20 at the lower end. The method of finding 7-scale equivalents for 
performances beyond the ranges of tested samples will not be described 
here. Suffice it to say that many test makers take pains to set up means 
of converting raw scores on all tests into T-score equivalents. The 7-scale 
principle can be used with any standard group of individuals, whether they 
are twelve-year-olds or not. The procedure for converting raw scores 
in any test into T-scale equivalents (though not with the twelve-year-old 
mean and unit) will now be described. z 

How to Derive T-scale Equivalents for Raw Scores.—A college or 
university or a single school system may wish to use the T-scale idea 
as its common yardstick for all its tests. The freshmen entering a large 
university, for example, may be taken as the standard group for this 
purpose. As an illustration, let us use the data in Table 12.2. Here isa 
distribution of 83 scores obtained by freshmen in an English examination 
of the objectively scored type. ( The procedure will be described step by 
step: 


Step 1. List the class intervals as usual. Here a maximum number of 
class intervals is best; 20 or even more. 

Step 2. List the exact upper limits of class intervals. 

Step 3. List the frequencies. 

Step 4. List the cumulative frequencies (see Ch. 6 for instructions). 

Step 5. Find the cumulative proportions for the class intervals. 

Step 6. Find the corresponding T scores from Table 12.3. These are then 
listed in the last column of Table 12.2, given to one decimal place. 
We usually want finally a ready means of reading directly the T 
score corresponding to any integral raw score. Itis recommended 
that the remaining steps be taken to satisfy this objective. 

(Step 7. Plot a series of points to represent each T score in Table 12.2 cor- 

responding to the upper limit of the class interval, às in Fig. 12.3. 
Tf the original distribution of raw scores is normal, the points 
should fall rather close to a straight line. The reason that they 
are not perfectly in line is that there are some irregularities in the 
C data. Draw through the points with a straight edge a line 


298 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 12.2—Tue CALCULATION or T SCORES For A DISTRIBUTION OF ENGLISH- 
EXAMINATION SCORES 


(1) (2) (4) (5) (6) 
eee Upper limit F Cumulative | Cumulative | T score (from 
of interval Ai ti frequency proportion Table 12.3) 
225-229 229.5 1 83 1.000 = 
220-224 224.5 0 82 988 72.6 
215-219 219.5 1 82 .988 72.6 
210-214 214.5 5 81 976 69.8 
205-209 209.5 5 76 916 63.8 
200-204 204.5 7 71 855 60.6 
195-199 199.5 6 64 Ah 57.4 
190-194 194.5 6 58 .700 55.2 
185-189 189.5 6 52 627 53.2 
180-184 184.5 11 46 6554 51.4 
175-179 179.5 9 35 .422 48.0 
170-174 174.5 5 26 313 45.1 
= 165-169 169.5 5 21 .253 43.3 
160-164 164.5 6 16 .193 41.3 
155-159 159,4 5 10 “120 38.2 
150-154 154.5 2 5 060 34.5 
145-149 149.5 1 3 036 32.0 
140-144 144.5 1 2 024 30.2 
135-139 139.5 0 1 012 27.4 
130-134 134.5 1 1 012 27.4 
80, 
1 
60 
2 z 
850 ` - mc 
| al 
8^0 
È 
30 
| 
20 


10 
120 RO 40 150 160 170 180 190 200 210 220 230 240 
Raw- score scale English- 
Fic. 12.3—A smoothing process applied in deriving T-scale equivalents ror ene 
examination scores (see Table 12.2). gy 


/ 


TEST SCALES AND NORMS 299 


TABLE 12.3—A TABLE To Arp IN THE CALCULATION OF T SCORES 


Proportion below T Proportion below T Proportion below 
the Point Beene the Point SENEE the Point T score 

.0005 17.1 . 100 37.2 -900 62.8 
-0007 18.1 -120 38.3 -910 63.4 
-0010 19.1 | .140 39.2 -920 64.1 
-0015 20.3 -160 40.1 -930 64.8 
-0020 21.2 .180 40.8 -940 65.5 
-0025 21.9 - 200 41.6 -950 66.4 
-0030 22:5 .220 42.3 -960 67.5 
-0040 23.5 | 250 43.3 -965 68.1 
-0050 24.2 300 44.8 -970 68.8 
-0070 25.4 -350 46.1 975 69.6 
-010 26.7 -400 47.5 -980 70.5 
-015 28.3 -450 48.7 -985 71.7 
-020 29.5 . 500 _ 50.0 -990 73.3 
025 30.4 . 550 51.3 -993 74.6 
-030 31.2 .600 52.5 .995 75.8 
-035 31.9 -650 53.9 .9960 76.5 
-040 32.5 -700 55.2 9970 77.5 
-050 33.6 750 56.7 9975 78.1 
-060 34.5 s -780 57.7 -9980 78.8 
-070 35.2 . -800 58.4 .9985 79.7 
-080 35.9 .820 59.2 -9990 80.9 
090 36.6 .840 59.9 .9993 81.9 

860 60.8 9995 82.9 


that will come as close to all the points as seems possible. Among 
those that do not touch the line, as many of them should be above 
it as below it. The line may be extended beyond the ends of the 
points at both ends. If the raw-score distribution is skewed, the 
trend in the points when plotted will show some curvature. It is 
best, then, to attempt to follow the curvature but with a smooth 
trend. If the curvature is not followed, the distribution of the 
population on the scaled scores will not be normalized. 


( Step 8. For any integral raw-score point, we can now find the correspond- 


ing T-score points. For example, in Fig. 12.3, a raw score of 
220 corresponds to a T score of 70, and a raw score of 150 cor- 
responds to a T score of 33. In this we favor integral T scores but 


300 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


at times have to resort to half points when we cannot decide upon 

i the nearest unit. 

\ Step 9. Prepare a table in which every integral raw score, or every second, 
third, or fifth one, appears in one column and the corresponding T 
scores in the other. | Table 12.4 is sucha tabulation. It will serve 
for all future purposes of translation where the original tested 
group remains the standard. Many test users prefer to list every 
raw score and its T-score equivalent so as to avoid the need for 
interpolation. 


TABLE 12.4—Rectiriep SCALING WITH T SCORES FOR THE DISTRIBUTION OF ENGLISH- 
EXAMINATION SCORES 


Examination Tare Examination Piare | Examination T score 

score score score 

240 81 195 57 155 35.5 
235 78 190 54 150 33 
230 75.5. 185 51.5 145 30 
225 73 180 49 140 27.5 
220 70 175 46 135 25 
215 67.5 170 43.5 130 22 
210 65 165 41 125 20.5 
205 62 160 38 120 17 
200 59.5 


A Normal Graphic Procedure for T Scaling.—It is possible to do more 
of the T scaling graphically by the use of normal probability paper. This 
graph paper is especially designed with spacing for cumulative proportions 
along one axis in a manner consistent with the cumulative normal-curve 
function. Figure 12.4 shows how the English examination data can be so 
treated. Using the cumulative proportions appearing in Table 12.2, 
column (5), we plot each one against its corresponding raw-score value 
given in column (2), The trend of the points will be in a straight line if the 
distribution of raw scores is normal. If that distribution is skewed there 
will be some curvature in the trend which one should try to follow in 
smoothing. To find the T equivalent for any raw score, we find that raw 
score on the base line, follow it up to the line drawn through the points, 
locate the equivalent proportion, then go to Table 12.3 for the correspond- 


ing T. F 

Man Evaluation of the T-scale Procedure.—The T scale is probably the 
most widely used of all derived scales. Its advantages are menys its 
disadvantages few. When the scaling is carried out, as described, the 


TEST SCALES AND NORMS 301 


procedure normalizes distributions.) This effect is pictured in Fig. 12.1. 
Contrast distributions D and Z in that illustration. Both have a mean of 
50 and ac of 10. The one is skewed like the original distribution, the 
other is normal. The normalizing process comes about through the con- 
version to centiles and then to corresponding deviations from the mean ina 
normal distribution. Table 12.3 is based upon the normal curve. Fora 
given proportion (area below a given point) is given a T-score equivalent 
instead of a standard-score equivalent. 


0999 m 
eed a E EEE E 
Eeee ae 
Feiss Se i 
a A 
Hr 


Hee cee 
it a 
Fe a a 
pal 


Hs Rey eT 
E 
E 


et oe tt 


nE a E HE 
Ii (Gc HHA 
0.001 IREUTTEGT FE ISEEN EgRSSG GU ane aE aaa aa 
120 40 160 180 200 220 240 
English examination scores 
Fic. 12.4.—A graphic solution to scaling which utilizes normal probability graph paper- 


The normalizing process may be pictured as in Fig. 12.5. There the 
obtained distribution, seriously skewed, is given below, and the normalized 
distribution on the derived scale above. The process assures that the 
areas A, B,C, ...,M correspond in the proportions that they occupy 
with areas AY BO a eas M'. The correspondences of scale distances 
are also shown, by connecting dotted lines. Ifthe unitson the derived scale 
(not shown) represent genuinely equal increments of the measured varia- 
ble, then obviously those on the original scale do not. We may not know 
that the population is normally distributed on a trait, but by normalizing 
distributions, where there is no inhibiting information to the contrary, we 
achieve more common and meaningful scores. ` 


302 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


_ Other advantages of the T scale have been mentioned—the possibility of 
extending it beyond limited populations, its convenient mean, unit, and 
standard deviation, and its general applicability. It has some limitations 
which should be pointed out. In much practical use of tests, as fine a unit 
as .lo may be an overrefinement. Much coarser discriminations are all 
that may be necessary. Furthermore, the unit may give quite a false 
sense of accuracy of the measurement that is actually being made.) If the 
original scores had a standard deviation much smaller than 10—for 
example, one of 5 score units—then the substitution of a unit of .1 ø isin a 
sense “hairsplitting.” Two whole units on the T scale are then as fine a 
distinction as we could actually make between individuals. Nor is this the 


ca ON San a a ceo ale 
whole story. Every test, even the best of them, has an error of measure- 
ment whose size is indicated by its “standard error of an obtained score i 
(see Ch. 17). This stems from the fact that the test is not perfectly 
reliable. If the error of measurement is as much as 2 units on the raw- 
score scale, it might be even larger on the T scale. Tf the error is such that 
the best practical discriminations we can make between individuals is of 
the order of one-half ø, it is rather presumptuous to apply a scale that pre- 
tends to distinguish to one-tenth ø. For this reason, particularly, and 
because many test users require less refinement than the T scale offers, the 
writer has proposed the C scale, which will be described next. 


Tue C SCALE AND C SCALING 


The C-scale System.—The principles of the C scale and the derivation 
of C-scale equivalents for raw scores are illustrated in Table 12.5. The Cc 


TEST SCALES AND NORMS 303 


TABLE 12.5.—THE ELEVEN-POINT SCALED-SCORE SYSTEM AND ITs APPLICATION TO 
THE MEMORY-TEST DATA 


(1) (2) (3) (4) (5), (6) @ 
Corre- Memory- 
Percent- | Percent- | sponding | test scores 
C-scale Standard- | Centile- |agewithin| agein |scorepoints) in each 
score score limits|rank limits each whole in the scaled- 
interval | numbers | memory score 
test interval 
Fer ys 9 cases AP A aca]. OO Tics) vcs se ai a Gl E T E a nies ere 
10 0.9 1 41+ 
Pa SSN a ae -- $2.25. .]...98.8.. ENEN, [adhe wn ore: sia SR, OO sesh omer eer 
9 2.8 3 ` 38—40 
ANELEE. +1.75 --96.0 REET EEE rari 37.6 EE 
8 6.6 7 35-37 
+1.25..)... BOAT call sero, ta acu [eens ne = SE Oxi | E E E 
12.1 12 31-34 
+0.75 Ars hc BPR e aa EA PN ET 
17.4 17 28-30 
+0.25 P A ER REE E ciara 27. Sic E E 
19.8 20 25-27 
—0.25 PAD E N POE. O E nT E EE 
17.4 17 21-24 
PR i B L A «ne y 2y E E a a aaa T, MEE E A E 
12-1 12 18-20 
E eae) ame (URC A E ETS Met cero Aa A ENA BOE E 
6.6 7 15-17 
—1.75 SAL Ope EE A ea see se sE A EAE E A 
2.8 3 12-14 
—2.25 E A Die | SE A E Ut E E E 
0.9 1 0-11 
E l ose Saa ae a ee e a ii ra naa ara wi lle sho] ad cei eR 


scale is so arranged that the mean will be exactly at 5.0, with the two limit- 
ing classes being 0 and 10. Column (2) gives the exact limits of the 11 
units in terms of standard scores. The corresponding centile limits 
(derived from Table B) are given in column (3) The percentage of cases 
within each unit is found by subtracting neighboring pairs of centile 
limits. Thus, in the middle unit, the difference 59.9 — 40.1 = 19.8, etc. 
Since it is more convenient to think in terms of whole numbers, the approxi- 
mate percentages of the cases falling in the different classes are given as 
nearest whole numbers in column (5). These can be used either as a guide 
in thinking of the make-up of the standard distribution or even in sub- 
dividing lists of scores of individuals when arranged in rank order. Thus, 


304 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


if we had 100 persons lined up in rank order in a test, the highest person 
would be given the score of 10, the next 3 a score of 9, the next 7 a score of 
8, etc., until the last in line is given a score of 0. 

Steps in Deriving a C Scale.—The operations for deriving a C scale are 
much the same as those for deriving a T scale. There are some differ- 
ences in the steps to be recommended, however, so all the steps will be 
listed here. 


Step 1. List the class intervals. 

Step 2. List the exact upper limits of the intervals. 

Step 3. List the frequencies. 

Step 4. List the cumulative frequencies. 

Step 5. Find the cumulative proportions for the intervals. 

Step 6. From here on the steps differ from those for T scaling. Next, plot 
the cumulative proportions on the ordinate corresponding to x 
values (exact upper limits) on the abscissa of coordinate papet- 
(See Ch. 6 for further instructions.) 

Step 7. Draw by inspection a smooth S-shaped curve through the trend 
of the points. If the distribution is obviously skewed and one tail 
of the S is short, or even if it vanishes, follow the points anyway- 
At this stage one sees the advantage of having a liberal number of 
classes. 

Step 8. Look for each of the centile limits [from column (3) of Table 12.5] 
on the ordinate, find the intersection of that centile-rank level 
with the curve, drop down to the abscissa to locate the correspond- 
ing raw-score point. Try to avoid arriving at a point exactly 
at integers, so that it is clear whether each integral raw score goes 
above or below the division point. The values obtained from this 
step are like those in column (6) of Table 12.5. 

Step 9. Determine within which C intervals the various integral score 


values lie and write the limiting scores as in column (7) of Table 
12,5; 


Alternative Graphic C-scaling Steps—If one already has a figure drawn 
like Fig. 12.3 that is used in T scaling, one could use it to accomplish steps 6 
and 7 in the following manner. Theo for the T scale is 10 and that for the 
C scale is 2. The means are 50 and 5, respectively. An interval of one 
unit on the C scale corresponds to five units on the T scale. AC score of 5, 
therefore, occupies a range from 47.5 to 52.5; a C score of 6 corresponds to & 
range 57.5 to 62.5, andso on. All the T-score limits of the C intervals can 
be seen represented in Table 12.6. The T-score limits, therefore, can be 


y 


TEST SCALES AND NORMS 305 


TABLE 12.6—T Scores EQUIVALENT TO C-scorE INTERVALS 


C score | T-score limits | Middle T score 

10 72 a 75 ‘ 
9 67. 5 70 

8 62.5-67.5 65 

7 57.5-62.5 60 

6 52.5-57.5 55 

5 47 .5-52.5 50 

4 42. 5-47.5 45 

3 37.5-42.5 40 

2 32.5-37.5 35 

1 27.5-32.5 30 

0 22. 5-27.5 25 

— ees 


located in Fig. 12.3 and from them the corresponding points of division on 
the raw-score scale. These mark off the raw-score ranges corresponding 
to all C scores. 

The normal-graphic procedure described in connection with T scaling 
can also be applied here; in fact, it is even more convenient in this connec- 
tion and is to be recommended in preference to steps 6and 7. Since the 
centile ranks are marked on probability paper (see Fig. 12.4), one would 
locate the centile-rank limits [column (3) of Table 12.5] and from the plot, 
usually a straight line, find the corresponding raw-score division points. 

An Evaluation of the C Scale.—The C scale has many of the advantages 
of the T scale, It refers obtained scores to a common scale that is related 
to the normal distribution. If the population distribution on a measured 
trait is normal, then the distribution of C scores properly represents that 
population and the units of measurement may be regarded as equal. It 
lacks the refinement of a small unit such as that provided by the T scale. 
On the other hand, it probably more nearly represents the accuracy of 
discrimination actually made by means of tests, and its broader categories 
will do for guidance purposes. There is a handicap in selection of person- 
nel in that a change of minimum qualifying score of only one C-scale unit 
may result in quite a difference in percentage of cases selected. For exam- 
ple, if the cutoff score were changed from 5 to 6, 20 per cent more rejec- 
tions would have to be made. For selection purposes, however, raw-score 
cutoffs may be just as feasible as derived scores. The reference of any 
chosen raw-score cutoff to equivalent C-score limits or centiles would add 


meaning to that particular value. 


306 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


For guidance and counseling purposes, the use of a zero C score may be 
unwise. Unless he is more sophisticated than most people, a counselee 
would hardly relish being told that he earned a score of zero. To meet this 
contingency, one could let the scores range from 1 through 11 instead of 0 
through 10. Or one could resort to a condensed scale to be described next. 

The Stanine Scale——There are several reasons for condensing the C 
scale to some extent by giving it a 9-unit range. This is usually done by 
combining the two categories at either end, with 4 per cent of the dis- 
tribution in categories 1 and 9. Such a scale was standard for the Army 
Air Forces Aviation Psychology Program during World War II. All test 
scores and composites were eventually scaled to this system, called “sta- 
nine” as a contraction of “standard nine.” The mean of such a norm 
distribution would be 5.0, as in the C scale, but the standard deviation 
would be slightly lower—1.96—because of the contractions at the tails of 
the curve. 

Perhaps the chief practical benefit to be derived from 9 units rather than 
11 is that such scores occupy only one column on the IBM punched-card 
records. For research purposes, however, a significant grouping error 
(see Ch. 5) is thus introduced, calling for corrections of various sorts when 
precise statistics are wanted. In guidance work, many counselors would 
probably not like to have the rare one person in a hundred at either 
extreme submerged with the other three per cent next to him. ‘There 1 
probably a full unit’s discrimination between the hundredth person and the 
next three per cent just as there is between any other neighboring cater 
gories. This loss of discrimination in the stanine scale may not be tolerate 
and is unnecessary in the use of profiles in guidance. 


Some NORM AND PROFILE SUGGESTIONS 


Suggestions were made in Ch. 6 concerning the derivation of centile 
norms and the construction of profiles. Here we are ready for other, more 
comprehensive, suggestions. There will be shown both a norm table anda 
profile chart, in each of which raw scores can be interpreted in terms of the 
C scale, T scale, or centile rank. 

Conversion Table for Deriving Scaled Scores and Centiles.—Table ro 
provides an example of a norm table based upon five parts of the Slee 
Zimmerman Aptitude Survey. The five parts represented include tests > 
Verbal Comprehension, General Reasoning, Numerical Operations, A pE 
Speed, and Spatial Orientation, respectively. By means of the table sed 
raw score can be readily transformed into a corresponding Sey a 
score. A raw score of 45 in Part I, for example, means a C score of 6; y 
interpolation, a centile rank of 68; and a T score of 55. Each test use 


TEST SCALES AND NORMS 307 


TABLE 12.7—EXAMPLE OF RECOMMENDED GENERAL-PURPOSE NORMS, BASED UPON 
Five Parts or THE GUILFORD-ZIMMERMAN APTITUDE SURVEY Tests (Form A) 


Part II | Part ITL | PartIv | Party | Cate 
limits score 
a4 | üo- | 72+ | s+ | 99+ | 75.0 
72.5 
23-24 | 95-109 | 66-69 | 46-51 | 96-98 | 70.0 
67.5 
21-22 | 86-94 | 60-65 | 39-45 65.0 
z 62.5 
18-20 | 77-85 | 5459 | 33-38 60.0 
57.5 
15-17 | 68-76 | 49-53 | 28-32 55.0 
52.5 
12-14 | 59-67 | 4448 | 22-27 50.0 
47.5 
9-11 | so-ss | 39-43 | 16-21 45.0 
42.5 
68 | 41-49 | 33-38 | 10-45 40.0 
37.5 
45 | 3240 | 27-32 | 5-9 35.0 
32.5 
23 | 23-31 | 21-26 | 24 30.0 
27.5 


< 23 


would probably choose his own mode of interpretation and make only one 
of these conversions for each score. The most ready interpretation possi- 
ble here is to the C scale. If one had many conversions to make to centile 
ranks, or to T scores, one would probably prefer a table requiring less 
interpolation. Such a table would need more intervals than 11, or differ- 
ent limits, for greater convenience. 

A Profile Chart with Three Interpretive Scales.—Fig. 12.6 shows a 
profile chart that can be used to provide not only a general contour repre- 
Senting several traits for an individual, but also reference to three common 
scales. The basic intervals are, again, the C-score units. Corresponding 
to each C score are given for each test the two limiting raw scores for that 
interval. At the lower margin are provided the single T score and the 
centile rank corresponding to the midpoint of each interval. F iner 
decisions on both these scales can be made, if desired, by interpolation. 

The illustration shows'the record for a certain individual who earned raw 
Scores of 28, 88, 21, and 23 in the Memory, Vocabulary, Word Building, and 
Sentence Construction tests, respectively. ' From the chart the C scores 


308 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


are obviously 6, 7, 6, and 4, respectively. The nearest centile ranks are 
70, 84, 70, and 30, respectively. If we want more exact centile-rank 
estimates, by interpolation we get something like 67, 82, 73, and 25. It is 
doubtful, however, whether the actual refinement of the tests justifies 
distinctions as fine as these. There is little virtue in quibbling over a dif- 
ference of 2 to 5 centile rank points (except at the extremes of the range) 
when the error of measurement may be as much as twice as large. In 
terms of T scores, the nearest values are 55, 60, 55, and 45, respectively. 
Again, finer estimates could be made but are probably not necessary or 
entirely justifiable. It may be sufficient, for guidance purposes, at least, 
to note that the individual represented is on the under side of a C score of 6 


10 |C score 


Memory 
Vocabulary 


Word Building 


Sentence 
Construction 


Centile 
rank 


75 |7 score 


Fic. pe els of norms on four tests in terms of C scores, T scores, and centile 
ranks. A profile has been drawn for one individual. 


in the Memory test, just above the middle of the C score of 7 for the Vocab- 
ulary test, in the upper half of the C score of 6, and on the lower side of the 
C score of 4 for World Building, if not to be content with the simple results 
of 6, 7, 6, and 4, 


Exercises 


1. Determine the standard scores for the two students in Data 124. Draw conclu- 
sions as to how the two students compare with respect to raw scores and with respect to 
standard scores. Which is Probably the better student in terms of aptitude (essa 
that all five tests measure different aspects of aptitude about equally well)? Which is 
the more variable student? Support your answers with evidence. 7 

2. Derive a conversion equation for translating scores in the Syllogism test into & 
scale which would give a mean of 50 and a standard deviation of 10. Using the equa- 
tion, what would the scores for students A and B become on the new scale? 

3. Determine the equivalent T scores for one or more of the distributions in Data 
12B. iof 
4. By a graphic process find by smoothing a modified set of equivalent T scores 
the same distribution or distributions. aw 

5. Using the T-score equivalents found in Exercise 3 or Exercise 4, convert the r 
scores for students A to D in Data 12C into corresponding T scores. 


TEST SCALES AND NORMS 309 


6. Determine the score limits for one or more of the tests in Data 12B corresponding 
to each C-score category. Use a smoothing process applied either to the ogive or to the 
line drawn on probability paper. 

7. Prepare a profile chart based on C-score units (but including also equivalent T 
scores and centiles) for the three tests in Data 12B. 

8. Report equivalent C scores for the four students in Data 12C. 


Data 12A.—MEANS AND STANDARD DEVIATIONS IN FIVE PARTS OF AN ENGINEERING- 
APTITUDE EXAMINATION (ROUNDED TO WHOLE NUMBERS) AND SCORES OF Two 
STUDENTS 


Paper Form 


Syllogism folding | perception 


Mekar yrs a4 Hanae 28 33 26 
IFaiaoieeateke 8 5 $ 
Student A 30 17 - 35 
Student B.. 15 32 41 


Dara 12B.—Frequency DISTRIBUTIONS OF ENGINEERING FRESHMEN IN THREE 
APTITUDE TESTS 


— M 


Scores | Cube visualizing | Syllogism | Form perception 
45-49 swa 4 eT. 
40-44 poe 13 2 
35-39 pet 29 16 
30-34 1 42 42 
25-29 8 45 52 
20-24 35 43 55 
15-19 58 24 26 
10-14 63 6 13 
5-9 36 1 
0- 4 6 yee ER 
Sums..... 207 206 207 


Data 12C.—Scorrs or Four STUDENTS IN THE THREE TESTS REPRESENTED IN DATA 
12B 


Student | Cube | Syllogism | Form 


A 25 22 34 
B 5 45 12 
c 33 42 37 
D 11 20 16 


CHAPTER 13 
SPECIAL CORRELATION METHODS AND PROBLEMS 


C Pearson's product-moment coefficient is the standard index of the 
amount of correlation between two things, and we employ it whenever 
it is possible and convenient to do so. But there are data to which this 
kind of correlation method cannot be applied, and there are instances 
in which it can be applied but in which, for practical purposes, other 
procedures are more expedient. The Pearson coefficient cannot or should 
not be computed, for example, unless the two variables X and Y are 
measured on continuous metric scales and unless the regressions are linear | 
(see Ch. 15). Many of our data are in terms of frequencies of cases 
having attributes; they are on variables of a “qualitative” rather than a 
quantitative sort. Less often, two continuously measured variables bear 
to one another a relationship that'is curved rather than in the form of 
a straight line. In this chapter will be described some procedures that 
take care of these irregular situations and of other situations where short- 
cut methods are better used to compute a Pearson v or its equivalent. 

Even when we can apply the product-moment correlation method, how- 
ever, there are many circumstances which may give rise to a somewhat 
different estimate of correlation than is typical or to one that does not 
apply to the population in which we are interested. Samples may be 
heterogeneous or they may be restricted in variability or they may be 
forced into a smaller number of categories than we need for good estimates 
of correlation, free from errors of grouping. These, and other common 
irregularities in the sampling situation or in the data, call for special cor- 
rective steps and for special interpretive action. It is impossible to 
anticipate all the peculiarities of data that the reader may encounter, pus 
the more common exceptions to ideal correlation conditions will be touched 
upon. 


SPEARMAN’S RANK-DIFFERENCE CORRELATION METHOD 


When samples are small, a common procedure applied to regular data 1 
the place of the product-moment method is the rank-difference method © 
Spearman. It is conveniently applied as a quick substitute when the num 
ber of pairs, or Ñ, is less than 30. It is even more conveniently applie 

310 


SPECIAL CORRELATION METHODS AND PROBLEMS 311 


when the data are already in terms of rank orders rather than in terms of 
measurements. 

The Computation of a Spearman Rho.—If we have data in terms of 
measurements or scores, it is first necessary to translate them into rank 
orders. The procedure will be demonstrated by means of the data in 
Table 13.1. There we have 15 pairs of scores for 15 individuals who 


TABLE 13.1—A RANK-DIFFERENCE CORRELATION BETWEEN HUMOR SCORES IN REAC- 
TIONS TO CARTOONS AND TO LIMERICKS 


Cartoon | Limerick R Rs D p: 
score score s 
47 75 11 8 3 9.00 
71 79 4 6 $ 4.00 
52 85 9 5 4 16.00 
48 50 10 14 4 16.00 
35 49 14.5 15 0.5 0.25 
35 59 14.5 15 0.5 0.25 
41 75 12.5 8 4.5 20.25 
82 91 1 3 2 4.00 
72 102 3 ft Al) 2 4.00 
56 87 7 4 | 3 9.00 
59 70 6 10 4 16.00 
73 92 2 2 0 0.00 
60 54 S 13 8 64.00 
55 75 8 8 0 0.00 
41 68 12.5 11 1.5 2.25 
" 165.00 

4 ID? 

a S S S 


responded to sets of cartoons and limericks by judging their humor 
values, each on a 5-point scale. The score in each case is the sum of 
the points each individual assigned to the set. We could correlate these 
Scores in the usual manner, described in Ch. 8. The rank-difference 
method will be found shorter. - The following steps are necessary: 


Step 1. Rank the individuals from the highest to the lowest in the first 
variable (here it is “cartoon score”), and call these ranks Ry. 
The highest score receives the rank of 1 (which is arbitrary; we 
might have called it 15), the next highest 2, etc. The only diff- 
culty encountered is when we find ties. For example, in Table 
13.1, two individuals have scores of 41. One of them comes at 


~ 


312 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


rank 12 and the other at rank 13. We do not know which, if either, 
is better, yet we must fill these two rank positions; so we take the 
average of the tied ranks and call them both 12.5. We make cer- 
tain that the next ranking scorer is called 14, unless he also is 
tied. We find that he is tied with another who has a score of 35. 
We treat these two in a similar manner; so they become each 
14.5. If the lowest person is not tied with others, the last rank 
should be equal to W (in this case, 15). This serves as a check as 
to accuracy of ranking, though, of course, it will not detect 
inversions in rank order somewhere along the line. It merely 
shows whether any rank has been repeated, whether any individu- 
als have been overlooked, or whether ties have somewhere not 
been properly treated. 

Step 2. Rank the second list of measurements in a similar manner, and 
call them Re. In this problem, there are three scores of 75 for the 
individuals occupying places 7, 8, and 9. We call them all 8, 
leaving out of the list 7and9. This treats the three alike, as they 
should be, yet gives us a full set of 15 ranks. 

Step 3. For every pair of ranks (for each individual), determine the 
difference in ranks. The smaller one can be subtracted from the 
larger one in each case, with no attention being paid to algebraic 
signs, for they are all going to be squared anyway. 

Step 4. Square each difference to find D2, 

Step 5. Sum the squares of the differences (see the last column of Table 
13.1) to find D*. The sum in our illustrative problem is 165.00. 

Step 6. Compute the coefficient p (Greek letter tho) by means of the 
formula 


6=D? i 13.1) 
= 1 — k-di i f lation) (1% 
P Vu? — T) (Rank-difference coefficient of corre 
where ZD? = sum of the squared differences between ranks. 


N = number of pairs of measurements. 
In this problem 


p=1— 8X 165 
15 X 224 
=1— .295 
= .705, or .70 


à ; tween 
By this procedure, then, the estimate of the amount of soeh hon r ji 
the two sets of scores is .70. How shall we interpret this correlation, 
compared with a Pearson 7? 


SPECIAL CORRELATION METHODS AND PROBLEMS 313 


Interpretation of a Rho Coefficient—The rank-difference coefficient is 
practically equivalent to the Pearson r numerically. There is a con- 
version formula by which the corresponding Pearson 7 can be estimated 
from rho. But this formula assumes large samples, which is precisely 
what we do not have when we compute rho, and in no case is the difference 
between rho and r greater than .018, and in every case, except for coeffi- 
cients of zero or 1.00, r is greater than rho. We may therefore treat an 
obtained rho as an approximation to z and under these circumstances 
interpret the outcome of a correlation study accordingly. 

The estimation of the reliability of rho, as indicated by its standard 
error, is in some doubt, but a rough approximation is given by the formula’ 


_ 1.04(1 — æ) 
Op = “AN =o (13.2) 


For the illustrative problem, 
haces 1.04(1 — .497025) 
p a/14 
= 14 


Formula (13.2) indicates that the sampling error for p is only about 4 per 
cent greater than for a corresponding Pearson r. The obtained c, of .14 
indicates that the obtained p of .70 is not a very accurate estimation of the 
correlation in the population. The standard error of .14 would allow 
considerable dispersion in sample p’s for samples of this size. Can we feel 
confident that the obtained coefficient indicates any positive correlation at 
all? Reference to Table D in which we look for 13 degrees of freedom 
shows that it requires a Pearson r of .514 to be significant at the 5 per cent 
level and an r of .641 to be significant at the 1 per cent level. The similar 
confidence levels for p would have to be about 4 per cent higher, or .54 and 
-67, respectively. There is less than one chance in 100, apparently, that 
such a correlation as .70 could have happened if there really were no 
correlation between the two variables—cartoon scores and limerick scores. 

A Method of Dealing with Ties.—DuBois has shown that the procedure 
of giving tied scores a common rank equal to the mean of the ranks involved 
in the ties is a good approximation, but that when more than two or three 
Scores are tied, a better estimate is desirable. His formula is 


(Tied rank corrected) (13.3) 


Ro = 4|M°r + 


1 Thornton, G. R. The significance of rank difference coefficients of correlation. 


Psychom., 1943, 8, 211-222; Olds, E. G. Distribution of sums of squares of rank differ- 
i ; 


ences for small numbers of individuals. Ann. math. Stat. 1938, 9, 133-148. 


314 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where Mz = mean of the ranks for the ties. 

n = number tied. : 
Every case of ties would have to be treated separately. For example, in 
our second set of scores in Table 13.1, there are three ties for eighth place 
(or three identical scores for places 7, 8, and 9). Applying DuBois’s 
formula,! 


fee et ae 


V 64 + .67 
»/ 64.67 
= 8.04 


Had we used 8.04 instead of 8 in the computation of rho, the result would 
hardly have been affected. With five or more ranked scores tied, however, 
the correction would probably make an appreciable difference. d 

A Brief Evaluation of the Rank-difference Correlation.—Because of its 
slightly larger standard error than for the Pearson v, and because its, use 1S 
limited to small samples, it is doubtful” whether Spearman’s p should be 
used for any purpose except to examine the possibility of correlation 
between two variables and not to estimate the size of the correlation 
coefficient unless it happens to be very large (not less than .70), and then 
only as a rough estimate. Its use for research purposes is therefore very 
limited. It is a product-moment coefficient and, as such, it approximates 
the corresponding Pearson r rather closely. The difference between the 
two is ordinarily well below the size of the standard error of either. There 
appear to be no specific requirements laid down by writers who describe this 
coefficient, but since it is a product-moment r and is an estimate of the 
Pearson y, its use for that purpose presumably rests upon the same assump- 
tions as are necessary in that connection—linear regression and homo- 
scedasticity, in particular. 


: 
Oo 


alin CORRELATION RATIO 


(The correlation ratio is a very general index of correlation particu- 
larly adapted to data in which a curved regression prevails. aes 
test scores, linear relationships are apparently the almost anye 
type of regression. Normality, or near normality, in both digba oF 
correlated is almost sufficient in itself to promote linearity. 26 
the sphere of psychological and educational tests, however, or W n- 
outside variables are correlated with test scores, we sometimes encou 


d 939, 3, 
‘DuBois, P. H. Formulas and tables for rank correlations. Psychol. Rec., 1 
46-56. 


SPECIAL CORRELATION METHODS AND PROBLEMS 315 


ter curved trends in the scatter diagram. D The means of the columns do 
not progressively increase as we go up the X scale. They may increase 
slowly at first then rapidly later, or they may increase to a maximum 
in the center and then decrease, or other systematic divergencies from 
linearity may be apparent. 

Nonlinear Regressions.—A common instance of nonlinear relationship 
is found when we correlate performance scores with chronological age. 
Typically, goodness of performance, as measured, increases most rapidly 
from ages five to ten and thereafter shows a slackening in upward trend 
through the teens. If we follow the progression still further, we find typi- 
cally a maximal performance somewhere in the twenties, with slow decline 
to the forties and an increasing rate of decline thereafter. If we included 
all ages from five to seventy-five in our correlation study and if we com- 
puted the usual Pearson r between age and scores, the z would probably 
prove to be near zero. On such a correlation diagram, the scattering of 
points would be considerably dispersed from any straight line that we 
might try to draw through the data, slanting upward or slanting downward. 
Inspection would show, nevertheless, a law of relationship between age 
and performance but a relationship that takes into account the waxing 
and waning of ability both within the span of ages studied. 

We might break the chart in two and treat by themselves the years 
during which there is improvement and by themselves the years during 
which there is decline. We should be able to compute a positive cor- 
relation for the earlier span and a negative correlation for the later 
span by assuming straight-line trends. But these would be of doubtful 
significance and certainly would not do justice to the full strength of 
relationships, even within the two segments of life span. The reason 
is that the trends still deviate from straight lines. Curvature has been 
overlooked, and to that extent the index of correlation is perhaps markedly 
underestimated. 

Two Regression Lines and Two Correlation Ratios—The scatter dia- 
gram in Fig. 13.1 represents a sample of relationship between perform- 
ance score in a form-board test and chronological age between five and 
fifteen years inclusive. Here the score is time required for completion, 
So a high number indicates a poor performance, and the trend is down- 
ward. But the relationship obviously drops most rapidly during the 
first 3 years and settles down to slight changes from year to year during 
the last 3 years, Two regression lines are drawn in the diagram to show 
more clearly the trends. The regression of test score on age is shown by 
the solid line that is drawn connecting the circlets, which are plotted at the 
means of the columns. ‘The regression of age upon test score is shown by 


316 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the dotted line and the means of the rows, by the x’s. Just as we find two 
regression lines (for an imperfect correlation) in Ch. 15, where linear 
regressions are involved, so here we find two regression curves, differing in 
shape as well as in slope. We have accordingly two correlation ratios, or 
eta coefficients, one for each of the regressions, and they will not neces- 
sarily be the same in value. This result differs from that in the case of 


X: Chronological Age in Years 12 
8 9 0 UN Ww B 1 fy 


—269| 1425 


Fic. 13.1.—A scatter diagram for a correlation-ratio problem. 


linear correlation, where ryz = ray. The two correlations ratios are given 
by the formulas 


oy 
Mz = = (Correlation ratio for the regression of Y on X) (13.4a) 
Y 
and 
oz $ 
na = a (Same, for regression of X on Y) (13.46) 
z 


where oy = standard deviation of the values (Y’) predicted from X, and 
ov = standard deviation of the X values predicted from F. 
oy andoz = standard deviations of the two total distributions. 
The manner in which oy and cy are determined will next be explained. 
The Computation of a Correlation Ratio—In a prediction problem rA 
this sort, the best prediction of Y for any column is the mean of the Y’s! 
that column. This prediction will have the smallest sum of as we 
deviations from the observed Y’s in that column. So Y” for each See 
is the mean of that column, We therefore first compute the means of 


SPECIAL CORRELATION METHODS AND PROBLEMS 317 


columns. These are listed in column (3) in Table 13.2. Now if there were 
no correlation, no law of relationship between Y and X, these Y ’ values 
would lie along the level of the mean of all the Y values, which in this 
problem is 23.0. No predictions could then be made on the basis of 
knowledge of X values. For every column with its X value (midpoint), 
the most probable corresponding Y would be 23.0 and our margin of error 
would be indicated by cy. It would be as large as if we had no knowledge 
of X for each individual (see Ch. 15 for a more complete discussion of this 
point). 

The more the means of the columns deviate from the mean of all 
the Y’s the more accurate our predictions are. We are therefore inter- 
ested in how far the F’ values do deviate from 23.0 in this problem. 
Those discrepancies (Y’ — M,) are given in column (4) of Table 13.2. 
As usual, we square the discrepancies or deviations and find their mean 
as an indicator of how great is their average. The squared discrep- 
ancies (Y’ — M,)? are given in column (5) of Table 13.2. But before 


TABLE 13.2.—TuHe COMPUTATION OF A CORRELATION RATIO FOR THE REGRESSION OF 


Tre SCORE ON CHRONOLOGICAL AGE 
ee le 


a | @ 7] @ i) 6) © 
Š 
a tte Ra y- M, |- M,) n(Y’! — M,)? 
14 10 | 11.0 | —12.0 144.00 1,440.00 
13 15 | 14.0 — 9.0 81.00 1,215.00 
12 12 | 14.5 — B15 72.25 867.00 
11 19 | 16.0| — 7.0 49.00 931.00 
10 18 | 18.1 — 4.9 24.01 432.18 
9 21 | 20.8 | — 2.2 4.84 101.64 
8 i$ | 25.1 + 2.4 4.41 79.38 
7 15 | 31.3 + 8.3 68.89 1,033.35 
6 13 | 40.5 $17.5 306.25 3,981.25 
5 9 | 49.8 | +26.8 718.24 6,464.16 
BUs 250 [case | aa | amn 16,544.96 En.(Y' — M,)? 
110.2997 6, 
10.50 oy 
sll Nel ell a T a L a 


finding a mean of the squared discrepancies, we weight each one for a 
column by the number of cases in that column. The weighed, squared 
discrepancy for each column will be found in the last column of Table 
13.2. Then they are summed, and we divide by N, which is 150 in this 
Problem, to find o%,, which is 110.2997. The square root of this is 
10.50, which is the o of the discrepancies. 


318 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Remember that these are zot the discrepancies of the observed points 
from the predicted Y values, for the larger these are the lower our cor- 
relation. We are here interested in the size of discrepancies between 
predicted Y values and the mean of all Y values, and the larger these 
are the higher our correlation. When the correlation is perfect, ew 
is as large as oy, for then the ratio oy/o, equals 1.00. When ow = 0, 
the ratio equals zero. In this problem, o, = 12.535. The correlation 
ratio is therefore 


The steps in computing a correlation ratio may be summarized as 
follows. Remember that for finding na, we are dealing with rows ae 
than columns, so the steps will be the same except for the substitution o 
the word row for the word column in what follows and the substitution 
of X for F. 


Step 1. Determine the mean of all the Y values and also their standard 
deviation. 

Step 2. Determine the means of the columns P) 

Step 3. Determine the discrepancies between Y’ and M, m 

Step 4. Square the discrepancies. 

Step 5. Multiply each squared discrepancy by the number of the cases 

' in the column (7). 

Step 6. Sum the weighted, squared discrepancies, and divide by N. 
gives o°». From this, find oy. 

Step 7. Solve the ratio oy/sy, which is iz 


This 


soa ere r- 
The Standard Error of a Correlation Ratio.—The reliability of & re 
relation ratio, like the reliability of r, is given by its standard error, € 
this is derived by a similar formula 


1—7? ; ; 
w= —— (Standard error of a correlation ratio) 
VN — 


(13.5) 


The standard error of the eta coefficient that we have just obtained is oe 

The amount of correlation is therefore rather close to the populat 

correlation. k TER 
The Standard Error of Estimate in a Nonlinear Regression- ee 

standard error of estimate here can be computed as from a Tenon pe 4 

formulas 15.16a and 15.16b), but it can also be obtained from the 

edge that 


2 2 = ¢2 
Oye + oy = oy 


SPECIAL CORRELATION METHODS AND PROBLEMS 319 


That is, the total variance in the Y distribution is made up of two com- 
ponents, the variance predictable from X (this is o*,) and the variance 
not predictable from X (which is o?yz)- Transposing, we have 


yz = oy — oy 

In solving for an eta coefficient, we must know both the terms on the 
right of thisequation. For our illustrative problem, they are 157.1262 and 
110.2997, respectively. The difference is 46.8265, which is the non- 
` predicted variance. The square root of this, which is 6.84, gives us 
oy. The standard error of estimate tells us how much dispersion there is of 
the obtained values (Y values in this case) around the predicted values (Y’ 
values in this case). The figure 6.84 tells us that two-thirds of the time 
scores in the Form Board test may be expected to be within 6.84 units of 
the predicted values, when the predicted values are the means of the 
columns of the scatter diagram. 

The Relation of the Correlation Ratio to Analysis of Variance.—Those 
who have read Ch. 10 will find much that is familiar in the preceding 
paragraphs. Regarding the successive columns of data, which are really 
the result of a one-way classification on a quantitative variable, namely, 
chronological age, as sets, we have all the information we need, to proceed 
with an analysis-of-variance solution (see Table 13.3). The sum 16,544.96 


TABLE 13.3—AN ANALYSIS OF VARIANCE BASED UPON STATISTICS DERIVED IN THE 
SoLUTION OF A CORRELATION Ratio 


Degrees of Sums of 


Component feeder squares Variances 
Between sets 9 16,544.96 1,838.33 * 
140 7,023.97 50.17 


Within sets... 


149 23,568.93 


1838.33 
iam poe 


will be recognized as the sum of squares between sets, since it is based upon 
the squared deviations of set means from the composite mean. The sum 
7,023.97 will be recognized as the sum of squares within sets. This sum is 
found most conveniently here from what we already know. It is given 
by the product Wo?yz, which in this problem is 150 X 46.8265 = 7023.97. 
The sum of the two sums of squares makes up the total sum of squares for 
the composite sample in variable Y. All we need next are the degrees of 
freedom. For the between variance there are 9 (the number of sets minus 


320 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


1). For the within variance there are 140 (N minus the number of sets). 
The two estimates of the population variance are given in Table 13.3, also 
the F ratio, which is 36.6. Reference to Table F (Appendix B) shows that 
it is well above the F required for significance at the 1 per cent level of 
confidence, which is about 2.5. 

The relationship pointed out here is more of academic interest than of 
practical interest, for we already know that the eta coefficient was so high 
that there was little doubt of a law of relationship existing between 
chronological age and test score. Furthermore, the eta coefficient tells us 
a fact, namely, concerning the degree of relationship, which an F ratio does 
not convey. When the eta is near the lower margin of significance and a 
more rigorous test of significance is required, when a decision is to be 
made as to whether or not there is any genuine relationship at all, then the 

test has its advantages, r 
\ A Test of Linearity of Regression.—Often the curvature in regression is 
so slight that we do not know but that it is merely a chance deviation from 
linearity. We therefore want some statistical test to show whether or not 
the curvature is probably real. Several tests of nonlinearity have been 
proposed. Probably the most dependable one is that suggested by Fisher- 
This method depends upon the already familiar chi square (see Ch. 11). 
For the solution of chi square here, we need to know the Pearson r for the 
same data for which an eta coefficient has been computed. The formula 
for chi square is 


x? = (N — k) (v=") i (Chi-square test of linearity) (13.6) 
=e 


where k = number of columns (or rows). For the problem in recent 
paragraphs, the Pearson r was found to be -763. By formula (13.6), we 
have 


2 a -702244 — 582169 
a = (A50. = 29) (mes — y 


-120075 
= a0) Es 
= 56.5 


With a chi square of this size and k — 2 degrees of freedom, the divergence 
between nys and ryz is so great as to leave little doubt about sake tel e 

The hypothesis tested here is that the regression of Y and X is a 
In more exact terms, the hypothesis requires that the means of the cotu as 
all fall exactly along a straight line whose slope is described by the ieee 
r. Now if the actual form of regression were linear, sampling errors wo 


SPECIAL CORRELATION METHODS AND RROBLEMS 321 


cause the means of columns to deviate slightly from the best-fitting straight 
line. The sampling distribution is of these deviations of the actual means 
of the columns, the My values from the regression line. These deviations 
are ordinarily sufficient to make the eta coefficient larger than the Pearson 
r computed from the scatter diagram. The question is whether the 
deviations are large enough to suggest that there is something over and 
above these chance deviations involved. That is what the chi-square test 
here is supposed to tell us. The chi-square test should be applied to 
this particular use only when N exceeds k considerably. 

An Evaluation of the Correlation Ratio.—The chief advantage and use 
of the eta coefficient has been indicated and illustrated—to determine the 
closeness of relationship between two variables when the regression is 
clearly nonlinear. Although very few nonlinear regressions have been 
found in the correlation of measures of ability, it is likely that there are 
many more such relationships in psychology and education than has been 
realized, This is true if we broaden our conception of the correlation 
problem considerably by saying that an index of correlation (index is a 
more inclusive term than coefficient) is a measure of the goodness of fit of 
obtained data to a regression line, whether it be straight or curved. The 
Pearson r indicates the goodness of fit of observed points to a straight line. 
Other indices, including eta, show the goodness of fit of data to other 
functions. 

Correlation Coefficients as Indices of Goodness of Fit.—This broadening of 
the concept of correlation would bring into consideration curves of learn- 
ing and retention and many others. The eta coefficient assumes no par- 
ticular type of functional relationship between Y and X. The type of 
relationship is defined by the actual, unsmoothed trend of the means of the 
columns (or rows). In this fact is both strength and weakness. Allowing 
the curvature of the regression to be as complex as the ups and downs in 
obtained class means make it, we find in eta the maximum size of correla- 
tion index for any set of data. We might assume some kind of mathe- 
matical function for the data represented in Fig. 13.1—a hyperbola or 
parabola, a logarithmic function or some other. The goodness of fit, as 
indicated by a correlation index, would probably not be so high for any of 
these functions as the eta coefficient indicates. Because the eta coeffi- 
cient does allow the regression curve to follow the means of the columns, a 
certain amount of error or purely sampling variance undoubtedly gets into 
the deviations of column means from the general mean of the Y’s and 
hence the eta is a somewhat inflated figure. When the actual regression is 
linear, the difference between eta and r computed for the same data tells us 
about how much inflation has occurred. When the regression is nonlinear, 


322 FUNDAMENTAL STATISTICS IN PS} "CHOLOGY AND EDUCATION 


we have less ready evidence as to how much inflation there is. We should 
therefore discount any eta a little, particularly if the means of sets do not 
follow a smooth trend rather well. The smaller the sample, the more 
irregular the trend of the set means is likely to be, and therefore the greater 
the proportion of inflation in eta. 

N Examples of Nonlinear Regressions.—In addition to the functional rela- 
tionships involved in learning and other phenomena, it is likely that when 
more is known about human traits that are not abilities—temperament, 
interests, and attitudes, and the like—and their interrelationships, we will 
find many more examples of nonlinear regression. In the validation of 
test scores against vocational or other criteria of adjustment, more and 
more of such examples are coming to light. \ It has been known for some 
time that high “intelligence” may be just as bad prognostically as low 
“intelligence” in connection with proficiency in routine and repetitive job 
assignments. ‘This result will.probably be found more general than has 
been supposed. The reason it has not been more widely recognized before 
is that relatively short ranges of ability have been related to proficiency 
criteria, If the total range, from lowest to the very highest, is studied 
in relation to proficiency indices on various kinds of jobs (except those 
requiring highest abilities) we may find the optimal ability to be somewhat 
short of the top in most cases. This definitely means nonlinear regressions. 
A number of instances have been called to the writer’s attention in which 
Scores on temperament tests bore a relation to rated proficiency in such a 
way that the optimal position on the trait score was barely above average: 
The application of the Pearson + method sometimes shows a near zero 
correlation in such instances whereas an eta coefficient might be as high as 
-30 or even .50. The straight line, in other words, was a very poor fit ie 
the regression of the data. This should stress the importance of plotting 
scatter diagrams more frequently than is ordinarily done, otherwise impor- 
tant nonlinear regressions may be overlooked.. No doubt many a zero 
Pearson reported in the literature conceals a significant nonlinear 
relationship. 

The Algebraic Sign of Eta:—Some writers regard it as a weakness of ela 
that its algebraic sign is always positive. The algebraic sign of r 38 
meaningful in that it shows whether the general trend is upward or down- 
ward. In defense of eta it may be said that it tells us the thing we are most 
interested in, the goodness of fit or closeness of relationship between ie 
things. If the over-all trend is either upward or downward we can readi y 
perceive that by inspection of the scatter plot and we can attach water? 
sign is appropriate if we wish to do so. Some curved regressions, ae 
U-shaped or an inverted U-shaped type, may yield a significant eta withou 


SPECIAL CORRELATION METHODS AND PROBLEMS 323 


any general trend away from the horizontal. In this case no sign is 
meaningful for eta. 

Dependence of Eta upon the Number of Categories.—A more serious weak- 
ness of eta is that its size depends upon the number of columns (or rows). 
The minimum number of classes that would show any curvature at all is 
three, but three might give a much-smoothed and distorted view of the real 
relationship. With too small a number of classes, therefore, we run the 
chance of obtaining an estimate of correlation that is too small. On the 
other hand, as we increase the number of classes, we make the means 
of the classes less stable, and, as they fluctuate more, chance errors become 
more important in inflating eta. The limiting case would be classes so 
small that there was only one observation per class (assuming no duplicate 
measures on X), in which case the variance in the columns would be just as 
great as the over-all variance in Y and eta would equal 1.00. Methods for 
correcting eta for number of classes have been proposed, but none can be 
recommended. The best rule would be to keep the classes large enough 
so that means of classes are fairly stable and fall rather smoothly into line in 
the scatter plot and yet to have enough classes to bring out clearly enough 
the shape of the regression. The size of sample has some bearing on this. 
The larger the sample, the larger the number of classes that can be toler- 
ated. Very small samples would be unsuitable for the computation of eta 
at all. With large samples (100 and above) it is suggested that the num- 
ber of classes range between six and twelve. 

The Use of Mathematical Functions—Better than the correlation-ratio- 
approach, in research studies, would be an effort to establish the form of a 
regression as some mathematical function and then test the goodness of fit 
of data to that function by methods which we cannot go into here. There 
are other texts that specifically treat this topic in some detail.’ 


(Tur BISERIAL COEFFICIENT OF CORRELATION / 


ø The biserial y is especially designed for the situation in which both of 
the variables correlated are really continuously measurable but one of the 
two is for some reason reduced to two categories. This reduction to two 
categories may be a consequence of the only way in which the data can 
be obtained, as, for example, when one variable is whether or not a stu- 
dent passes or fails to pass a certain criterion of success. We can well 
assume a continuum along which individuals differ with respect to the 
ability required to pass this criterion. Those having a degree of ability 


ment of data. New York: Wiley, 1946; Lewis, D. 


a Deming, W. E. Statistical adjust: 
Iowa City: The author, 1949. 


Quantitative methods in psychology. 


324 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


above a certain crucial point do pass it, and those having a degree of 
ability below that crucial point fail to pass. ` 

Let us assume that the criterion is graduation from pilot training. 
Although not all graduates are equal in achievement nor are all eliminees, 
all we know is whether each person belongs to one category or the other. 
It is as if our grouping were so coarse in this variable as to be confined 
to two class intervals rather than a dozen or so. If we are prepared to 
justify normality of distribution in this dichotomous variable, we have a 
formula by which a coefficient of correlation can be computed. 

Computation of a Biserial r—The principle upon which the formula 
for a biserial r is based is that with zero correlation there would be no 


Below average ability — 0—>Above average ability 


Fic, 13.2.—A normal distribution of the cases along the scale of ability to pass the cou 
of training. The area to the right of the ordinate shown represents the 65 per cent w 
graduated and the area to the left represents the 35 per cent who failed to graduate. 


rse 
ho 


difference between means, and the larger the difference between means 
the larger the correlation. The general formula for biserial r is 


M, —M 
n = =— x 24 (Biserial coefficient of correlation) (13.7) 


ot y 
where Mp = mean of X values for the higher group in the dichotomous 
variable, the one having more of the ability in which the 
sample is divided into two subgroups. 
M, = mean of X values for the lower group. 
p = proportion of the cases in the higher group. 
q = proportion of the cases in the lower group. 
y = ordinate of the normal distribution curve with sur: 
to 1.00, at the point of division between segments C! 
ing p and q proportions of the cases (see Fig. 13.2). 
cı = standard deviation of the total sample in the continu 
measured variable, X. ; 
Table 13.4 presents typical data for computing a biserial 
The passing group were distributed as shown; also, the failing gr 
proportions passing and failing are .65 and .35, respectively." 


face equal 
ontain- 


ously 
correlation. 
oup. The 
The y 


1 It is good practice to compute # and g each to three significant digits. 


SS 


SPECIAL CORRELATION METHODS AND PROBLEMS 325 


Taste 13.4—Distrmutions OF Scores FoR Two Groups OF STUDENTS— THOSE 
PASSING AND THOSE FAILING—ALSO A COMBINED DISTRIBUTION 


Scores 


n n/N 


Passing students......--].-- 1|3|10]27]30| 26 | 21) 7| 5 130) .65= p 
Failing students.......- 2)6|4]11)21/16] 7| 3]--]-- 70| .35 = 
77 7| 5 |200) 1.00 


ordinate (from Table C) is 3704. The distribution of the total group is 
assumed to be as indicated in Fig. 13.2. The computation of the biserial 
r proceeds as follows: i 


98.27 — 83.64 ., (.65)(.35) 

n= m “~~ 3704 
__ 3.328325 
= 6548672 


= .508 


Table G (Appendix B) is designed, in part, to supply several of the 
constants needed in the computation of a biserial r either by formula 
(13.7) or by formula (13.9), and the computation of its standard error. 
For given values of p, Table G supplies the corresponding values of pq/Y, 


b/y, and +/pq/¥- 


The Standard Error of 7».—The standard error of a biserial r is estimated 
by the formula 


via — r% 
est. 
b vN 


where the symbols have already been defined above. 
In this problem 


(Standard error of a biserial r) (13.8) 


Or 


A770 
FT 258064 

dia 200 

1.0297 

14.142 

073 


326 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


This standard error may be interpreted as usual, and we find that the 
obtained v, is so large as undoubtedly not to be arising from an uncorre- 
lated population. 

Alternative Formula for Biserial y—In many situations, a more con- 
venient formula for the biserial y is! 


~Mr—- My p 
ot y 


(Alternative formula for a biserial r) (13.9) 


where the only new symbol is M,, the mean of the total sample. The 
greater convenience of this formula over the other is that formula (13.9) 
gives us one less distribution to deal with. A good type of work sheet 
for solution by this formula is shown in Table 13.5. It is convenient to 


TABLE 13.5.—SoLuTION or MEANS AND STANDARD DEVIATION NECESSARY ‘FOR THE 
COMPUTATION OF A BISERIAL r 


Scores x fo Sox! fi fe! fix’? 
130-139 +4 5 +20 5 +20 80 
120-129 +3 T| i 7 | +21 63 
110-119 +2 21 +42 24 +48 96 
100-109 +1 _ 26 +26 33 +33 33 
90- 99 0 30 0 46 0 0 
80- 89 al 27 —27 48 —48 48 
70- 79 =f 10 —20 21 —42 84 
60- 69 = a lay a eae 63 
50- 59 -4 1 E, 7 —28 112 
40- 49 -5 2 ~10 50 
SUNS APORON ah 130 +49 200 27 629 
ee S ee ee ae 


p= +377 cy = —135 o = 10 ~/829$ 99 — -135° 
T ~1.35 = 10 \/3.1268 
93.15 = 17.68 


w 
a 
a 

© 
ined 


use the same guessed mean for both the component distribution and for 
the total distribution. By this procedure, the biserial r and its or come 
,out the same as we have already seen. 

( An Evaluation of the Biserial y.—Since the biserial coefficient of cor 
lation is a product-moment r and is designed to be a good estimate of the 
Pearson 7, the same requirements as for the latter must be satisfied 
linear regression and homoscedasticity—plus the unique requirement oye 
the distribution of the values on the dichotomous variable, when continu 
item evaluation. 


Te- 


1Dunlap, J. W. Note on computation of biserial correlations in 
Psychom., 1936, 1, 51-60. 


‘compare the two form 


SPECIAL CORRELATION METHODS AND PROBLEMS 327 


ously measured, shall be normal. \ This requirement of normality applies 
to the form of population distribution. Even if the sample distribution is 
not normal, the population distribution may still be normal. 

The use of the quantities p, g, and y in formulas (13.7) and (13.9) 
directly implies the normal distribution of the dichotomized variable. 
Departures from normality, if marked, will often lead to very erroneous 
estimates of correlation. With bimodal distributions, for example, it is 
possible that z will prove to exceed 1.0. Bimodal and other nonnormal 
distributions are most likely to occur in heterogeneous samples—for 
example, in variables on which there is a significant sex difference and 
both sexes are included in a sample. 

When to Dichotomize Distributions —The biserial r is very useful, in fact 
it is sometimes essential, and when properly used is a very good substi- 
tute for the Pearson rz. There are instances in which the Y variable has 
been continuously measured but there are irregularities that preclude 
computing a good estimate of the Pearson z. In such cases the biserial r 
may be brought into service. One example of this would be a truncated 
distribution; another would be when there are very few categories for the 
Y variable and it is doubtful whether they are equidistant on a metric 
scale; another would be in the case of a badly skewed distribution in Y 
values owing to a defective measuring instrument. Before computing 7, 
we would, of course, need to dichotomize each Y distribution. In this 
we would have some choice, and it would be well to make the division 
point as near the median as possible. The reason for this will be made 
clear in the next paragraph. In all of these peculiar instances, however, 
we are not relieved of the responsibility for defending the assumption of 
the normal distribution of Y. It may seem contradictory to suggest that 
when the Y distribution is skewed we resort to the biserial r, but note 
that it is the sample distribution that is skewed and it is the population 
distribution that must be assumed to be normal. 

Biserial r Is Less Reliable than the Pearson r—Whenever there is a real 
choice of computing a Pearson r versus a biserial r, however, one should 
favor the former, unless the sample is very large and unless computation 
time is an important factor. The standard error for a biserial y is quite a 
bit larger than that for a Pearson r derived from the same sample. If we 
ulas for the standard errors, formulas (9.24) and 
(13.8), we find that the only real difference is in the numerators. One 
reads 1 — y2and the other reads ~/pq/y — 7°». Ifwe examine the \//pq/y 
values in Table G, we find that even when this value is smallest (and that 
it is about 25 per cent larger than 1. When = .00, 


is when p = q = -5), 
herefore at least 25 per cent larger than that 


the standard error of 7 is t 


328 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


for r for the same size of sample. As p approaches 1.0 or 0.0, the ratio 
(\/pq/y) becomes larger, until when $ = .94, it is as large as 2. This is 
why in the preceding paragraph it was recommended that dichotomies 
have the division point as near the median as possible. It also suggests 
that we need larger samples for the same dependability of r, than for r 
and that we should hesitate to compute 7, for very one-sided divisions of 
cases unless the sample is extremely large. ‘This is reasonable from 
another point of view. Remember that prominent in the formula for re 
is the difference between means. This difference is not very stable unless 
each mean comes from a sample of sufficient size. Even if the sample 
totaled 1,000 cases, if only 1 per cent of the cases were in one of the two 
categories, its mean would be based upon only 10 cases. This is not 
favorable to reliable estimates based upon such a mean. 

Other Serial Correlations.—Formulas have recently been developed by 
Jaspen for the correlation of a continuous variable with another variable 
that has been artificially classified in three, four, or five categories. 
Owing to the rareness of the need for such formulas space will not be taken 
to present them here. If one has more than two categories, he can always 
combine certain ones to make two and then compute 7, provided, of 
course, the necessary assumptions are satisfied. 


i POINT-BISERIAL CORRELATION 


When one of the two variables in a correlation problem is a genuine 
dichotomy, or when it is doubtful that the dichotomous one stems from a 
normal distribution, the appropriate type of coefficient to use is the point- 
biserialz. Examples of genuine dichotomies are male versus female, being 
a farmer versus not being a farmer, owning a home versus not owning ones 

_living versus dying, living in Boston versus not living in Boston, and BO on 
( Bimodal or other peculiar distributions, although not representing entirely 
discrete categories, are sufficiently discontinuous to call for the point- 
biserial rather than the biserial r.) Examples of this type are glor 
blindness versus normal color vision{ being alcoholic versus nonalcoholic; 
and criminal versus noncriminal. d 

There are other variables, though not fundamentally dichotomous am 
they may even be normally distributed, which we have to treat as ! 
they were genuine dichotomies in practical operations. An outstanding 
example of this is a test item that is scored as either right or oar 
No doubt those who answer the item correctly are not all equally capa 
in the ability or abilities measured by the item. A total test score ior 
provide continuous gradations in ability levels. In testing practice, ho 


1 Jaspen, N. Serial correlation. Psychom., 1946, 11, 23-30. a 


SPECIAL CORRELATION METHODS AND PROBLEMS 329 


ever, the kind of item described is limited to separating individuals into 
two groups, and only gross predictions can be made from responses to it. 
Such a variable is a good example to explain the basic nature of the point 
biserial z. If we gave a “score” of +1 to each person with a correct 
answer and a “‘score” of zero to each person with a wrong answer, in the 
item variable we would have only two class intervals and we treat them as 
if they were genuine intervals. A product-moment r could be computed 
with Pearson’s basic formula. The result would be a point-biserial r. 
A special formula is provided, however, which does not resemble the basic 
Pearson formula. It reads, 


‘ig M, — Me VAT AES point -biserial coeficient of correla- (13.10) 


where the symbols are defined just as they were in the formula for the 
ordinary biserial y (formula 13.7).! The only difference between this 
formula and the one for the ordinary biserial y is that the numerator 
contains „/pq rather than pg, and the constant y is missing from the 
denominator. For the same set of data, then, the ordinary biserial r 
would be +/pq/y times as large as rpui- In this ratio lies a feature of 
pv to which we will return soon. 

Let us apply formula (13.10) to some data on the relation of body 
weight to sex membership. Ina sample of 51 sixteen-year-old high-school 
students, of whom 24 were male and 27 were female, the mean weights in 
kilograms were 67.8 and 56.6, respectively. The proportion of males is 
accordingly 24/51 = .471 and q is .529. The standard deviation of the 
combined distributions was 13.2. - This is a rather small sample on which 
to compute a point-biserial r, but it is favorably divided near the median 
and, at any rate, will do as a simple illustration. Solving with formula 


(13.10), 
Ti = 67.8 — 56.6 V(.471)(.529) 


13.2 
11.2 


m 499 
132° 


= 42 


The correlation between sex and body weight for sixteen-year-old high- 


School students is estimated to be .42. 
To the knowledge of the author no standard-error formula has been 


developed for rpi It is suggested that a test of the hypothesis of zero 
correlation can be made by means of a # test of the difference Mp — Ma. 


1 For a derivation of this formula, also formula (13.11), see Appendix A, 


330 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The decision about this hypothesis should be the same as that for the 
hypothesis of a zero difference. 

Alternative Methods of Computation for 7,,;—As for the ordinary 
biserial 7, there is an alternative formula for computing psi, which may be 
more convenient in many situations. It reads 


i Mp— M: |p (Alternative formula for the point-biserial (13.11) 
Bes or q correlation coeflicient) ‘ 


Formulas for rp»: making unnecessary the computation of p and g are 


en (M, — M,) V Nog (Alternative formulas for the point- (13.12) 


r rset 
= . No: biserial r) 


rai = Me — M) [No (13.13) 
T Na 


where N, and N, are the frequencies in the two categories. 

An Evaluation of the Point-biserial y—Since the rp»; coefficient is not 
restricted to normal distributions in the dichotomous variable, it is much 
more generally applicable than is 7. Whenever there is doubt about 
computing 7», the point-biserial r will serve. For this reason, it should 
probably be used more than it is. Being a product-moment 7, it rests 
upon the assumption of linear regression, though when the dichotomy 1s 
genuine, the regression could be nothing other than lineafs’ Although @ 
product-moment 7, in value it is rarely comparable numerically with a 
Pearson v, or even with an ordinary biserial 7, even when computed from 
the same data. This is its greatest weakness as a descriptive statistic. 
Under special circumstances, to be described, it may be used as a basis 
for making an estimate of the Pearson r. 

Relation of rp»: to r—When properly applied, r, gives coefficients that 
are generally good approximations to Pearson r’s that could be computed 
from the same data had both variables been continuously measurec: 
Consequently, all the usual interpretations that are made of r (see Ch. 15 
can also be made of 7. 

If ys: were computed from data that actually justified the use © 
however, the coefficient computed would be markedly smaller than “è 
obtained from the same data. Even if the one variable is actually oe 
tinuous but not normally distributed, in which case we might bette 
utilize rpi, the latter would give an underestimate of the amount y5 
correlation. As was pointed out before, 7 is ~/pg/y times as Jarge E 
Twi When they are computed from the same basic data. This ratio varie 


and 


Í bs 


SPECIAL CORRELATION METHODS AND PROBLEMS 331 


from about 1.25 when p = .50 to about 3.73 when $ (or q) equals .99 (see 
Table G). Fig. 13.3 shows graphically the ratio of rs: to rẹ for various 
values of p. The ratio of 7,2; to 7 is, of course, the reciprocal of the ratio 
of rp to rpi in other words, it is y/+/pg- The diagram is designed in this 


Ratio of point-b/seria/ r to biseria/ T 


50 0.60 0.70 0.80 0.90 1.00 

p (Proportion in larger category) 
Fic. 13.3.—Ratio of the point-biserial r to the biserial r when the difference between 
means (Mp — M,) and the standard deviation (c) are constant and the proportion in the 
larger category ($) varies. 
manner to show maximum values of rp»: that would arise from continuous, 
normal distributions. In terms of formulas, 


vt (13.14) 


y (Conversion of one biserial r into the other when 
normality of distribution exists) 


Tb = Tpbi 


(13.148) 


Bd 
Troi = To 
pi V bq 


It is recommended that when the dichotomous variable is normally 
distributed without much doubt, rẹ be computed and so interpreted. If 
there is little doubt that the distribution is a genuine dichotomy, 7pbi 


332 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


should be computed and so interpreted. For the doubtful situations, the 
Tpi Should be computed but interpreted in the light of Fig. 13.3. That is 
to say, if the distribution in question is continuous but not normal, and if 
Ti approaches the limit described by Fig. 13.3, we can say that the 
genuine correlation approaches 1.00 more closely than the obtained 
rpi does. If the obtained rp: should exceed the limit, for the size of p 
involved, it probably means that the assumption of a genuine dichotomy 
is the correct one. In other words, when there is a point distribution, 
Twi can approach 1.00. Many distributions are in the doubtful class; 
they are neither dichotomous nor continuous. At least, if they are con- 
tinuous, they may not be unimodal. It is to help take care of these 
twilight instances that Figure 13.3 was designed. 

If it develops after we have computed 7,,; that the situation justifies 
the use of r, we can convert the obtained rp» to the appropriate 7s by 
means of formula (13.14). If we have computed r, when it later develops 
that we should have used rp», formula (13.145) will provide the proper 
transformation. 


) 
t p TETRACHORIC CORRELATION 


/ 

\A tetrachoric is computed from data in which both X and Y have 
been reduced artificially to two categories. Under the appropriate con- 
ditions it gives a coefficient that is numerically equivalent to a Pearson ’ 
and may be regarded as an approximation to it. It is sometimes the only 
way of estimating the correlation between two variables because the data 
could not be obtained in graded quantities. It is sometimes a quick and 
convenient method of estimating r from data that are inthe form of con- 
tinuous ny i but time is an important consideration and the 
sample is large. _ n 
t Assumptions Underlying the Tetrachoric r—The tetrachoric r requires 
that both X and F be continuously variable, normally distributed, 2” 
linearly related.) A problem in which the tetrachoric r may be compute 
is illustrated in Table 13.6, if we are willing to make the necessary assump 
tions. These data represent the numbers of students responding yes I 
and “No” to two questions in a personality questionnaire. Question -- 
was, “Do you enjoy getting acquainted with most people?” and Ques: 
tion II was, “Do you prefer to work with others rather than soared 
Out of 930 replies to both questions, we have the numbers who respon ded 
similarly (cells @ and d in Table 13.6) and the number who ao ae 
differently to the two questions (cells 6 and c). It is obvious that By Be 
case of a perfect positive correlation, all the cases would fall in cells a 
and d. Ina perfect negative correlation, they would fall in cells b ane & 


SPECIAL CORRELATION METHODS AND PROBLEMS 333 


TABLE 13.6—Fourrotp TABLE FROM WHICH A TETRACHORIC COEFFICIENT OF CORRE- 
LATION Is COMPUTED 


Question I 
Yes No Total Proportion Division Point 
ordinate deviate 
Yes 167 541 .582 3905 2070 
(b) (2) o) (2) | 
No 203 389 -418 
(a) (9) 
H Total 370 930 1.000 
E Propor- 398 1.000 
9 tion (q’) 
© Ordi- 
nate 
Deviate 


In a zero correlation, the frequencies would be proportionately distributed 
in the four cells." 

The assumption of continuity and normality of distribution can be 
defended as follows: It is unlikely that all who respond “Yes” to either 
question do so with equal degree of affirmation. It is similarly unlikely 
that those who respond “No” do so with equal degree of negation. Itis 
most likely that either question represents a continuum of behavior extend- 
ing from strong affirmation at the one extreme to strong negation at the 
other. Continuity is thus the probable state of affairs, not a real dichot- 
omy. Ifa continuum is granted, the general law of unimodal distribution 
approaching normality in psychological traits may be cited in defense of 
the other requirement. By making the necessary assumptions, at any 
rate, many things can be done with such data that would otherwise be 
impossible. As in most statistical operations where true form of distribu- 
tion is unknown, we can here remember that we have taken the chance of 
faulty assumptions and interpret results with the requisite reservation. 

The Equation for the Tetrachoic r—The complete equation for the 
tetrachoric r is indeed a long and complicated one, involving a series 
including many of powers of r. The first few terms included, it reads 


1 It will be noted that the categories for X are in an unusual order (positive or “good” 
end toward the left), which makes the regression “line” slope downward to the right 
for a positive correlation. For some reason, tradition has kept to this arrangement. 
Other 2 X 2 tables reverse this order, in keeping with the usual scatter diagram. Then 
the letters a and b, also c and d, are reversed. Letters a and d always stand for like- 


Signed cases in this volume. 


334 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


pj y 21 5 


=) 
rer > Fine 6 yy Ne 
The symbols will be explained with reference to Table 13.6. The letters 
a, b, c, and d refer to the frequencies in the four cells of the fourfold table. 
riis given the subscript to indicate that it isa tetrachoricr. Numerically, 
it is equivalent to a Pearson r. ; 

In Table 13.6, it will be noted that the distribution of responses to 
Question I is given in terms of proportions p’ and g’. The distribution of 
all responses to Question IT is similarly given in terms of p and q. These 
proportions are required for finding the values for the y’s and 2’s in formula 
(13.15). The symbols z and 2’ stand for the standard measurements on 
the base line of the normal distribution curve at the points of division 
of cases in the two distributions. 

From Table C, we find that z is .2070 and z’ is .2585. The symbols 
y and y’ stand for the ordinates in the normal curve at the points of 
division. From Table C, we find that they are .3905 and .3858, respec- 
tively. N is, of course, 930. We now have all the values except r: for 
which we must solve the equation. 

An Approximate Solution for a Tetrachoric r.—Let us ignore all terms 
involving higher powers of r, than the second. We can then reduce 
formula (13.15) to a quadratic equation that is readily soluble but that 
will give only an approximation to 7; The terms ignored are rather small, 
however, and so can be disregarded. With substitutions, the equation 
becomes 


nı + (20702585) 5 _ (374)(203) — (167) (186) 
t 2 (3905) (.3858) (9302) 
which reduces to 


re + .026755r*,. = .344279 


It is well in this solution to carry at least six decimal places in order to 
assure a sufficient number of significant digits later. 

We have now arrived at a quadratic equation, which, with rearrange- 
ment of terms, becomes 


0267551? + re —.344279 = 0 


And this is in a form to which we can readily apply a standard algebraic 
solution. If the standard quadratic equation is written 


ar’, + bn +c =0 


we can solve for r: by using the following formula: 


SPECIAL CORRELATION METHODS AND PROBLEMS 335 


—b+vVi?-—4 
ele NL (13.16) 


a, 2a 


In our equation for r: above, a is .026755, b is 1.0, and c is —.344279. 
Substituting these in formula (13.16) we have 


_ —1.0 + V1 — 4(.026755)(—.344279) 
ý 2(.026755) 

_ —1.0 + 1.01825 

E .05351 


Here we have a choice between the positive or the negative square root. 
If we choose the negative one, the numerator becomes —2.01825, which 
would give us an r, equal to about —40. This is obviously absurd, since 
no law-abiding r can go below —1. Taking the positive root, we have 


__ 01825 
m= "05351 
= .341, or +.34 


We can now say that, if our assumptions about the two distributions 
are granted, the correlation between an expressed enjoyment of getting 
acquainted with people and an expressed preference for working with 
others is +.34. For greater refinement in the solution, we could now 
treat .341 as a trial value for r: in equation (13.15) and see how much 
discrepancy is involved when the term having 7° in it is included in the 
calculations. We could make any change in 7 that seemed necessary 


for a better satisfaction of the equation and by successive trial-and-error 


maneuvers arrive at a more exact choice of 7. Probably most data are 


not of sufficient number or precision to justify the extra labor involved 
in this. The discrepancy involved when all powers higher than two are 
ignored in equation (13 15) is probably much smaller than the standard 
error of r, unless 7 is fairly large. f À 

The Standard Error of a Tetrachoric r—The tetrachoric r is less 
reliable than the Pearson r, being as much as 50 per cent more variable. 
rı is most reliable (1) when W is large, as is true of all statistics, (2) when 7 
is large, as is true of other 7’s, but also (3), when the divisions into two 
categories are close to the medians. The complete formula for estimating 
€, is entirely too long to be practical; so it will not be given here. But 
when 7, = .0, the formula is much simpler and reads 


pray (Standard error of a zero tetrachoric r) (13.17) 


on yy VN 


336 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where the symbols mean the same as in formula (13.15) or in Table 13.6.! 
This is the most useful of the formulas for øn, at any rate, because it per- 
mits testing the null hypothesis. If the correlation in the population is 
zero, samples of the size we obtained would yield 7’s with a standard 
error as given by this equation. For the 930 cases in our problem, 


V (.582)(.602) (.418) (.398) 


(.3905) (.3858) 1/930 
= .053 


Tt 


Since the obtained 7, is .34, being more than 2.6 times this standard 
error, we can be quite positive that the two qualities represented by the 
two questions are really correlated in the population. Had this been a 
Pearson r, ø, (for r = .0) would have been .033. This gives a rough idea 
of the relative degree of sampling fluctuation of the two kinds of r and 
also bears out the statement made earlier that the tetrachoric r is about 
50 per cent more variable. This fact should impress one with the impor- 
tance of using a larger sample when 7 is to be the index of correlation. 
Roughly, to attain the same degree of reliability in a tetrachoric r, one 
needs more than twice the number of cases as in the use of the Pearson r. 
For very dependable results, it is recommended that W be at least 200 
and preferably 300 when r; is to be computed. In smaller samples than 
these, even less than V = 100, a tetrachoric r can be used to test the null 
hypothesis, but it cannot be depended upon to give very accurate esti- 
mates of the size of correlation unless r is very large. 

Other Procedures for Estimating r,—A tetrachoric v can be estimated 
well enough for practical purposes by a number of short-cut procedures, 
two of which will now be described. 

The Cosine-pi Formula.—One approximation formula for r; is known as 
the cosine-pi formula. In mathematical form, 


Since for computing purposes here r can be taken as 180 degrees, the 
practical form of the equation is 


fi = COS 180° Voc _ (Cosine-pi approximation formula for 13.18) 
t Jad + Tie estimating a tetrachoric 7) (13. 


‘For aids in estimating øn, see Guilford, J. P, and Lyons, T. C. On determining 
the reliability and significance of a tetrachoric coefficient of correlation, Psychom., 1942, 
7, 243-249; also Hayes, S. P. Tables of the standard error of tetrachoric correlation 
coefficient, Psychom., 1943, 8, 193-203. 


SPECIAL CORRELATION METHODS AND PROBLEMS 337 


where a, b, c, and d, are the frequencies as defined in Table 13.6. It is 
well to remember that b and c represent the unlike-signed cases and a and 
d the like-signed cases. When numbers are substituted, the expression 
within the parentheses reduces to a single number which is an angle in 
terms of degrees of arc. The cosine of this angle is the estimate of 7. 
The angle will vary between zero, when either b or c is zero, or both, to 
180°, when either a or d is zero, or both. In the first case, when the angle 
is zero, the correlation is +1.0 and in the second case, when the angle is 
180°, r; is —1.0. When the product bc equals ad, the angle is 90°, the 


cosine of which is zero, and r; = .0. 
Let us apply the cosine-pi formula to the data of Table 13.6. 


gies i 180° 4/(167) (186) ] 

á / (374) (203) + v (167)(186) 
180°(176.3) 

Soz E + s] 

= cos 70.24° 

= 343 


The cosine of an angle of 70.24 degrees (as found by interpolating in 
Table J, Appendix B) is .343. This estimate of r, for these data checks 
very closely with that reported earlier (.341). 

In this method, if the angle should prove to be between 90° and 180° 
the correlation is negative. This can be anticipated by noting that the 
product bc is greater than ad. Angles over 90° are not listed in Table J. 
For angles between 90° and 180°, find the cosine of 180° minus the obtained 
angle. 

Graphic Estimates of Tetrachoric r—When a large number of tetrachoric 
r’s must be computed, considerable saving of labor is provided by the 
Thurstone computing diagrams.' These are highly recommended since 
they yield two-place accuracy with little effort after the fourfold table 
is completely reduced to the status of proportions throughout, as in 
Table 13.7. From the computing diagrams, r: for the data in Table 13.7 
is estimated to be +.79. The correlation of the two questions of Table 
13.6 is estimated as +.34, which checks with previous estimates. Another 


graphic procedure was recently published by Hayes.” 


1 Chesire, L., Safir M., and Thurstone, L. L. Computing diagrams for the tetra- 
choric correlation coefficient. Chicago: University of Chicago Bookstore, 1938. 
2 Hayes, S. P. Diagrams for computing tetrachoric correlation coefficients from 


Percentage differences, Psychom., 1946, 11, 163-172. 


338 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 13.7.—THE REDUCTION or A SCATTER DIAGRAM TO A FOURFOLD TABLE PREPARA- 
TORY TO THE COMPUTATION OF A TETRACHORIC COEFFICIENT OF CORRELATION * 
Mark in Schoolwork 


IQ F D c B A Total 
120 and above............. a ides 12 32 40 84 
DOI aa E $i 4 23 66 23 116 
100-109. 1 10 67 77 15 170 
E009 1 22 133 40 3 199 
PI rai E iaa 8 71 125 21 2 227 
LO IO arin es 3 25 855.588 36 92 24 1 T 153 
T E EN 27 36 4 ky PA 67 
BU OES e tinisos vers $005 73 235 388 237 83 1,016 
In terms of frequencies In terms of proportions 
IQ 
Go law] ma | OF | Are | Ta 
below below 
SUO ADONE uia aeoo iise 273 296 569 269 -291 . 560 
IB CLOW 90 ceanii ia 423 24 447 -416 024 440 
Total... 696 320 | 1016 | .685 | .315 | 1.000 


* Adapted from Cobb, M. V. The limits set to educational achievement by limited intelligence. 
J. educ. Psychol., 13, 1922, p, 449. By permission of the publisher. 

Reducing Distributions in Class Intervals to Fourfold Tables.—Data 
need not be obtained in two categories each way in order to apply the 
tetrachoric solution for z. Any scatter diagram, in fact, can be reduced 
to two groups each way by making arbitrary divisions. Such a division 
should be made as nearly as possible at or near the median in each dis- 
tribution. Table 13.7 shows a scatter diagram in which reduction to a 
fourfold table would be highly desirable. A Pearson r computed with so 
few class intervals each way would be highly influenced by errors of 
grouping. The very large number of cases renders the reduction in reli- 
ability of r by computing 7, of small importance. The divisions suggested 
in Table 13.7 come between the B’s and C’s for distribution of school 
marks and at an JQ of 90 for intelligence rating. The revised correlation 
distribution is seen in Table 13.7. 

Some Applications of 7, to Be Avoided.—Many of the limitations of the 
tetrachoric r have already been pointed out. There are others which 
should not go unnoticed. It is well to avoid estimating r: when the split 
in either X or Y is very one-sided—for example, a 95-5, or even a 90-10, 
division of the cases. The standard error is much larger in such situations 
as these. 


SPECIAL CORRELATION METHODS AND PROBLEMS 339 


Especially to be avoided is an attempt to estimate 7, when there is a 
zero in only one cell. Table 13.8, A and B, illustrates two such examples. 


TABLE 13.8.—ILLUSTRATIONS OF SOME UNUSUAL FOURFOLD CONTINGENCY TABLES IN 
Wuicn COMPUTATION OF A TETRACHORIC r Is QUESTIONABLE 


0 200 200 110 80 190 15 85 100 
110 90 200 0 150 150 105 95 200 
110 290 400 k 110 230 340 120 180 300 

A B C 


If r, were computed for problem A it would equal —1.0 (the zero is in 
cell a); if computed for problem B, r; would equal +1.0. This is in 
spite of the fact that about one-fourth of the cases belie the perfect corre- 
lations apparent by computation (90 cases out of 400 in A are out of line 
with the finding and 80 cases in B). These examples are perhaps some- 
what rare, but zero frequencies are certainly not unheard of. Even 
scatters like that in C would probably give a false estimate of correlation. 
There is no zero, but there is an exceptionally small frequency (15) among 
much larger ones. In all three fourfold tables the distributions are such 
as to suggest nonlinear regressions if these broad categories were broken 
down into finer groupings. If the assumption of linearity is not satisfied, 
rı may well give a biased estimate of correlation. Such distributions as 
those in Table 13.8 are not proof of nonlinear regression but they strongly 
suggest it. In general, a distribution in such a table should appear to be 
rather symmetrical around one diagonal axis or the other, depending upon 
whether the correlation is negative or positive. This holds true if the 
proportion p is somewhat near the proportion p’ but if they differ too 
much, asymmetry cannot be taken necessarily to mean curved regression. 


{( Tue Pur Corrricrent |) 


/~> When the two distributions correlated are really dichotomous, when 
l the two classes are separated by a real gap between themand previous 
correlational methods do not apply, we may resort to the phi coefficient.’ 
This was designed for so-called point distributions, which implies that the 
two classes have two point values or merely represent some unmeasurable 
attribute. Such a case would be illustrated by eye color, sex, “living 
versus dead,” and the like. The method can be applied, however, to data 


that are measurable on a continuous variable if we make certain allow- 
i 


1 Also known as the Yule # or sometimes as the Yule-Boas ¢. See Yule, G. U. On 
the methods of measuring the association between two attributes. J. Roy. Stat. Soc., 


1912, 75, 576-642. 


340 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ances for that fact.) It is a close relative of chi square, which is applicable 
to a wide variety of situations. 

The Computation of Phi.—To illustrate the use of phi (#), we shall use 
again some data that were previously employed with chi square (see 
Table 11.3). They are repeated here as we need them, in proportion 
form, in Table 13.9. 


TABLE 13.9—A TABLE TO ILLUSTRATE THE CORRELATION OF ATTRIBUTES 


Normal | Feebleminded Both 


MAE aiaa a 269 -204 473 
(a) (8) (2) 
Unmarried: se. iesene 231 - 296 527 
(y) (8) @ 
BON aaan 500 . 500 1.000 
($°) g’) 


o= vane (The phi correlation coefficient) (13.19) 


where the symbols correspond to the labeled cells in Table 13.9.1. The 
solution of ¢ for this table is 


— (.269)(.296) — (.204) (.231) 


$ V (473) (527)(.5)(.5) 
_ 0325 

~~ (2496 

1302, or .13 


ll 


. The Relation of Phi to Chi Square.—Phi is related to chi square froma 
2 X 2 table by the 


very simple equation 


x = Ne? A (Chi square as a function of phi) (13.20) 
and phi is derived from chi square by the equation 
y3 
g= a (Phi as a function of chi square) (13.21) 


By formula (13.20) 
xX = (412)(.016952) 
= 6.98 


This checks with the solution of chi square by other methods (see Ch. 11). 


1 For a derivation of formula (13.19), see Appendix A, 


Fe 


SPECIAL CORRELATION METHODS AND PROBLEMS 341 


Since phi can be derived directly from chi square, when applied to a 
2 X 2 table, any of the formulas for chi square given in Ch. 11 will apply 
to its computation. Formula (11.7), especially, which is very similar to 
formula (13.19) above, is probably more convenient. Applied directly 
to the computing of phi, it becomes 


(ad — be) (Phi computed from (13.22) 
-Ja +b)la + c)(6 + d)(c + d) frequencies) 


The Special Case of Phi When One Distribution Is Evenly Divided.— 
When one of the distributions, let us say the one for which we use p’ and q’ 
as total proportions, is evenly divided so that p’ = g’ = .50, the solution 
of ¢ is considerably simplified. The formula reads 


a—Bp r aii 4 
$= (Phi from evenly divided proportions) (13.23) 
V p4 
Applied to the data on marital status 
l 
— 269 — .204 
vV (.473)(.527) 

_ 065 
"4993 
= .130 


This particular case is useful in many an experimental situation where 
two separated groups are selected with equal numbers of cases. There is 
some question here, of course, as to how well the samples chosen repre- 
sent the larger population from which they were obtained, and so inter- 
pretations should be stated with this knowledge in mind. 

The Reliability and Significance of Phi—The formula for the esti- - 
mation of the standard error of phi involves such laborious computations 
that it is impractical for general use. It will not be given here. A test 
of the null hypothesis, fortunately, can be made through phi’s relation- 
ship to chi square. If x’ is significant in a fourfold table, the correspond- 
ing ¢ is significant. The procedure, then, is to derive the corresponding 
x? from the obtained ¢ by means of formula (13.20), then examine Table E 
to find whether for one degree of freedom the required a of sig- 


nificance is met. In the marital problem, we find that a, hi sq 
6.98 is significant beyond the 1 per cent leval, theii Ñ tan wk 
of .13 is likewise significant. oh ah 

d E yh 


Ra Evaluation of the Phi Coefficient 


=Phi 

NI is ¢ 
efficient of correlation. Its formula İs a Variation oe Preduct4 
ati vt Ek 

on of Pearson s E IT 


=e 


342 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


mental equation, r = Zxy/Noz,. The similarity may be seen to some 
degree, at least, if we break the denominator of formula (13.19) into two 
components, »/pq and +/p’q. These are the standard deviations of the 
two point distributions, in Y and X. If we give numerical values of +1 
and 0 to the two categories in X and in F, and if we carry through the 
computation of a Pearson v in a scatter diagram of four cells, we arrive 
at a correlation coefficient equal to ¢. 

( Limitations io the Size of Phi—While $ can vary from —1.0 to +1.0, 
only under certain conditions can ¢ be as large as either of these extremes, 
even though a tetrachoric r, if computed for the same data, would yield 
an r, equal to 1. This is probably its greatest weakness, \but in certain 
practical situations (see Ch. 17) it is a realistic feature. The reason is 
that the reduction of frequencies to a 2 X 2 table places serious restric- 
tions upon ¢ that do not affect rų} The general principle is that ¢ can be 
as great as 1 only when the two variables are divided so that p = p’ or 
p = q' (and, of course, g = g’ or q = p’). To illustrate these restrictions, 
we may take a few special cases in which p = .5 but p’ is allowed to vary. 
Such instances are pictured in Table 13.10. With an even division of the 


TABLE 13.10.—Sonte FOURFOLD CONTINGENCY TABLES ILLUSTRATING THE DEPENDENCE 
OF THE Size or A Put COEFFICIENT UPON THE MARGINAL TOTALS 


5 


40| 10} 50 
90 | 10 | 100) 


$= .58 $= .33 o= .35 
Cc D E 


cases in the two categories in Y, only with an even division also in X is it 
possible to have a perfect correlation, as shown in contingency tables A 
and B. With a division of 75-25 in variable X, the maximum ¢ would 
be .58 (contingency table C) and with a 90-10 division, the maximim $ 
would be .33. In contingency table Æ the division in X is again 75-25 
but there-is departure from maximal relationship. The obtained ¢ of .35 
may be interpreted for size in the light of the maximal ¢ possible with 
the particular combination of marginal totals, if we are interested in the 
underlying strength of relationship between X and Y. If we are inter- 
ested in making predictions from categories to other categories, however, 
the obtained ¢ is a more realistic figure. The problems of prediction 
come in the chapters to follow. 


SPECIAL CORRELATION METHODS AND PROBLEMS 343 


Determination of a Maximal Phi Coefficient.—Because of the increas- 
ing importance of the phi coefficient, particularly in connection with test- 
item intercorrelations, it is desirable for the purposes of orientation to 
have some conception of the drastic limitations to the size of phi. In 
general, the maximal ¢ for any combination of p and p’ can be calculated 
by means of the formula 


PAWL i ý (feximal pe 
= fi) (2 ie BS vith dif- 
max v(2) (£) (where pi > fi 5 ) TEL a (13.24) 


nations of pi 
and $;) 


where p: = largest marginal proportion in a 2 X 2 contingency table. 
pi = larger of the two marginal proportions in the other variable. 
Wherever p: = p; the maximal ¢ equals 1.0. To apply this to Table 


13.10, C and E, 
.50\ /.25 
Pes = (3) (3) 


= 3333 
58 


Computations with formula (13.24) are greatly facilitated by use of 
Table G where values of +/p/q and ~/q/p are given. Formula (13.24) 
can be broken into the two components ~/p;/q; and v qi/ pi Whose product 
gives the maximal phi. 

Figure 13.4 provides a graphic solution to the same equation for values 
of p; from .50 through .98 and $; throughout the same range. These 
ranges will take care of all practical situations in which ¢ would ordi- 
narily be computed. It is recommended that the maximal ¢ that suits 
any given situation be considered when interpreting an obtained ¢ as 
representing a strength of the intrinsic relationship between two vari- 
ables. The word intrinsic is stressed here, because the actual size of 
indicates the degree of practical, predictive value of the relationship. 
Predictive value is actually restricted by inequality of p: and p; 

The Coefficient of Contingency.—It has been shown how a ¢ coefficient 
can be derived from chi square. Phi squared, for a 2 X 2 table, is equal 
to chi square divided by N. For this reason ¢°? has been called the mean- 
ve might call ¢ the mean contingency, 


square contingency. By analogy, Y 
although this name is not used for it. When there are more than two 
classes in either X or Y, or in both, however, there is another correlation 
index, called the coefficient of contingency, and it is designated by the let- 


ter C. The formula for deriving it from chi square is 


344 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


2 (Coefficient of contingency computed from chi 


square) (13.25) 


Like ¢, the coefficient of contingency is restricted in size, but not to 
the same extent. When the number of categories is large (at least five 
each way), C approaches the Pearson r in size. If the categorized data 


1.00 


0.90 


AAG 
PV AT TTT 


o 
y 
o] 


Maximum Phi coefficient possible 


0.50 0.60 0.70 0.80 0.90 1.00 
Smaller of the two proportions (P;) 


Fic. 13.4.—Maximum phi coefficients for different combinations of proportions of cases 
in the categories in X and Y having the larger frequencies. 

represent continuous, normal distributions, if V is large, and if class 
intervals are of approximately equal size, the correction procedures applied 
to the Pearson r, described later in this chapter (Table 13.15), may be 
applied to the C coefficient. If the data are in genuine categories (point 
distributions, or nearly so), it is best to interpret C asit is. The maxi- 


mum C for each given number of categories each way is shown in Table 
13.11. 


© 


SPECIAL CORRELATION METHODS AND PROBLEMS 345 


TABLE 13.11.—MAXMAL VALUES ATTAINABLE FOR A COEFFICIENT OF CONTINGENCY 
WITH DIFFERENT NUMBERS OF CATEGORIES IN Borms X AND Y VARIABLES 


Number of cate- 
gOrieS. e.e eoret 2 3 4 5 6 7 8 9 10 


Maximum C......- -707 | .816 | .866 | .894 913 | .926 | .935 | .943 | .949 


The standard error of C involves so much computation that it is hardly 
worth the effort to estimate it. A formula for this is given-by Kelley.* 
For testing the hypothesis of zero correlation in a population, the chi 
square from which C is derived will serve very well. 


PARTIAL CORRELATION 


The Meaning of Partial Correlation.—A partial correlation between 
two things is one that nullifies the effects of a third variable (or a number 
of other variables) upon both the variables being correlated. The corre- 
lation between height and weight of boys in a group where age is per- 
mitted to vary would be higher than the correlation between height and 
weight for a group at constant age. The reason is obvious. Because 
boys are older, they are both heavier and taller. Age is a factor that 
enhances the strength of correspondence between height and weight. 
With age held constant, the correlation would still be positive and sig- 
nificant, because at any age taller boys tend to be heavier. ` À 

If we wanted to know the correlation between height and weight with 
the influences of age ruled out, we could, of course, keep samples separated 
and compute r at each age level. But the partial-correlation technique 
enables us to accomplish the same result without so fractionating data 
into homogeneous groups. When only one variable is held constant, we 
speak of a first-order partial correlation. The general formula is 


(First-order partial coefficient of 


correlation) (13.26) 


nee rig — 713723 
meS PL — 1°03 
In a group of boys aged twelve to nineteen, the correlation between height 
and weight (riz) was found to be .78. Between height and age, ris = -52- 
Between weight and age, 723 = .54. The partial correlation is therefore 
a .78 — (.52)(.54) 
112.3 = Son EAD) 
Vd — 5A — 54%) 
_ -4992 
= 77189 
= .69 


1 Kelley, T. L. Statistical method. New York: Macmillan, 1923. P. 269. 


346 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


With the influences of age upon both height and weight ruled out or 
nullified, then, the correlation between the two is .69. 

As another example with three variables, the correlation between 
strength and height (rs:) in this same group was .58. The correlation 
between strength and weight (rs2) was .72. Although there is a signifi- 
cantly high correlation between strength and height, we wonder whether 
this is not due to the factor of weight-going-with-height rather than to 
height itself. So we hold weight constant and ask what the correlation 
would be then. Will boys of the same weight show any dependence of 
strength upon height? The correlation is given by 


58 — (.72)(.78) 


_ 0184 
4343 
= .042 


By partialing out weight, it is found that the correlation between height 
and strength nearly vanishes. We conclude, therefore, that height as such 
has no bearing upon Strength, but only by virtue of its association with 
weight does it show any correlation at all. 

Second-order Partials—When we hold two variables constant at the 


same time, we call the coefficient a second-order partial r. The general 
formula is 


712.3 — 114,37 e4.3 (Second-order partial coeff- 


PERT (1 — ry 3)(1 — 704.3) cient of correlation) (13.27) 


In using this formula, the subscripts will have to be modified to suit the 
choice of variables. Here we are assuming that we want to know the 
correlation that would occur between X, and X, with the effects of Xs 
and X; eliminated from both. It is clear that this formula requires the 
solution of three partials of the first order previously. 

As an example of this partial, we may cite the correlation between 
strength and age with height and weight held constant. This would 
mean that if a group of boys having the same height and weight were 
taken, would older boys be stronger? The raw correlation between age 
and strength was .29. The second-order partial also turned out to be .29. 
This means that it seemingly makes no difference whether we allow height 
and weight to vary or whether we do not; the relation between age and 
strength is the same within the range examined. 

Some Suggestions Concerning Partial Correlation—Needless to say, 
unless the assumptions necessary for computing the Pearson r’s involved 
are fulfilled, there is little excuse for using them as the basis for com- 


SPECIAL CORRELATION METHODS AND PROBLEMS 347 


puting partial correlations. There are actually few occasions in psy- 
chology and education when a partial 7 is called for. The partialing out 
of such things as chronological age is perhaps the most common instance 
in which it is a useful device. It is not to be recommended as a lazy 
man’s substitute for experimental control and fractionation of data. 
The newer processes of analysis of variance and tests of significance of 
statistics from small samples make experimental planning seem more 
important and the treatment of results more satisfactory without resort 
to partial correlations. It is inadvisable, in any case, to carry the partial- 
correlation method much beyond the first-order stage. Beyond this, the 
structure of the relationships becomes very much involved, and one is 
bringing more and more raw 7’s into consideration, each with its own 
fallibility. The building of an elaborate superstructure of statistics upon 
foundation stones that are not highly accurate in themselves often leads 
to questionable results. 

Reliability and Significance of an Obtained Partial r—The standard 
error of a partial coefficient of correlation is the same as for a Pearson r 
except that the number of degrees of freedom should be used in the 
denominator. The general formula is 


= 1 iium (Standard error of a partial r) (13.28) 


Cras ..m 7 
N—m 


where 7 is the number of variables involved. 


SOME SPECIAL PROBLEMS IN CORRELATION 


The Relativity of All Coefficients of Correlation.—It is apparent that 
the size of the coefficient of correlation depends to some extent upon the 
method of computing it. What is more important, coefficients computed 
between the same two variables by the same procedure will vary not only 
from sample to sample but from population to population. If there are 
any really absolute correlations in the universe, all variables except the 
two being held constant, those correlations are probably either zero or 1, 
or close to either of those values. With contaminating variables left in, 
the correlations are usually between zero and 1. It is therefore really 
meaningless to speak of the correlation between intelligence and character 
ed even that we know what those variables are and have 
m) or even between age and height or any other 
out at the same time specifying what kind of sam- 


(if it is assum 
properly measured the 
common variables with 


ple we measured. 

A coefficient is always relative to the kind of population sampled and to 
the measurements were made. In reporting coefficients 
ld be very careful to state all the pertinent 
of his obtained correlation coefficients, and 


the manner in which 
of correlation, any writer shou 
factors that bear upon the size 


348 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


any reader should accept interpretations only when the significant circum- 
stances are kept in mind. A few of the more common sources of vari- 
ations of size of r will be reviewed briefly in what follows. 

The Variability in the Correlated Variables.—The size of r is very much 
dependent upon the range of ability or, in more general terms, the vari- 
ability of measured values, in the correlated sample. The greater the 
variability, the higher will\be the correlation, everything else being equal. 
It should be easier to predict individual differences in scholarship in a 
class with JQ’s ranging from 50 to 150 than in a class where the range is 
restricted to 90 to 110. If the restriction were to a range of zero (all JQ’s 
being equal) there should be no correlation whatever—the limiting case, 
in which, of course, no r could be computed at all. Often we know the 
correlation between some predictive index, such as aptitude-test score and 
scholarship or some vocational criterion of success as derived from one 
group of individuals, but we shall be applying the same index to other 
groups with different ranges of ability, larger or smaller. What will be 
the effectiveness of predictions in the new groups? 

In the selection of personnel by means of tests, as during World War II, 
research on selective instruments was constantly beset with this very 
practical problem. New tests were put into use in the selection of 
personnel, and they correlated substantially with tests that were used in 
selection. The result was that the men who went into training repre- 
sented only a higher segment of the population from which selection was 
to be made by the new tests. The validity of a test could be estimated 
only for this higher segment of restricted range. And yet, it was the 
validity in the total population that it was important to know, for it is 
that validity which indicates. the full selective value of the test. The 
coefficient of validity in the restricted group is almost invariably smaller 
than what it would be in an unrestricted group. 

In a research program such as that on the selection and classification of 
aviation trainees during World War II, the problem of restriction of range 
became quite important. Near the end of the war, about 50 per cent of 
the applicants for aircrew training failed to pass the general qualifying 
examination, and of these as many as 75 per cent failed to qualify for a 
particular type of training. Furthermore, it was desired to correlate 
tests with advanced-training achievement criteria and even combat per- 
formance after many more had been eliminated at various stages of 
training. The proportions of the original applicants who survived to 
these stages were rather small. Restriction of range was very great. 

Karl Pearson, many years ago, provided a solution to this problem that 
applies under certain conditions. The variables being studied must be 


SPECIAL CORRELATION METHODS AND PROBLEMS 349 


normally distributed in the population and we must know certain param- 
eters or estimates of them in order to solve the problem in any particu- 
lar situation. We need to know the relation of the dispersions in the 
restricted and unrestricted populations, either in terms of the variable on 
which selection occurred or on the basis of some variable correlated with it. 
We also need to know the correlation in the restricted population between 
the variable we wish to validate and the criterion of success in training 
or on the job. There are three formulas of practical use in this problem, 
each of which recognizes the availability of certain information and the 
need for validation of a certain kind of variable. 

Case I,—Restriction is produced by selection on the basis of X, and 
there is knowledge of standard deviations in X1 for both restricted and 
unrestricted groups. The correlation riz is known in therestricted group. 
The correlation Riz for the unrestricted group is estimated by 


21 
= an (Correl: d f 
1 orrelation corrected for re- 
Re = = striction of range, Case I) (13.29) 
Zz 
1— r +r’ (2) 


or) 


where rı2 = correlation between X, and X2 in the restricted group. 
c, = standard deviation in measurements on X, in the restricted 


group. 
X, = standard deviation in the same variable in the unrestricted 


group. i 
In this and in the next two formulas, capital letters stand for values per- 
taining to the unrestricted population and lower-case letters refer to the 
restricted population. 

The application of this formula is as follows. Suppose that the 
selection test (X1) correlated .30 with the training criterion in the 
group selected on the basis of the test. The standard deviation in 
the unrestricted group (21) was 20 and that in the restricted group (01) 
was 10. The solution is 


Ree 30(?% 0) 
20? 
qi — .09 + (.09) io? 
2 .60 
Vi — 09 + 36 


350 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Case II.—Restriction is produced by selection on the basis of X, and 
there is knowledge of standard deviations for X» in both restricted and 
unrestricted samples and of the correlation 712 in the restricted group. 
The correlation in the unrestricted group is estimated by 


= os 2 (Correlation corrected for restric- 
Re = yi ~ $2, (1 — r’) tion of range, Case IT) (13.30) 


where oz = the standard deviation on X in the restricted group. 

the standard deviation on X% in the unrestricted group. 

This formula would apply when we correlate two selection tests, when we 
have selected on the basis of one test (X1) but know the change of range 
from knowledge of variances in the other test (X2). One or both of the 
“tests” might be a composite score derived from a combination of several 
tests. An example of this from aviation psychology was the correlation of 
an experimental test with the pilot stanine (composite aptitude score) 
when selection had been made on the basis of the stanine and it was more 
convenient to use the change in dispersion on the test. If we assume the 
same restricted correlation (ri. = .30) as in the previous illustration, also 


that the restricted and unrestricted standard deviations are 10 and 20, 
respectively, 


Il 


Be 


ll 


Case IIL.—Restriction is produced by selection on variable X3, on 
which variable the restricted and unrestricted standard deviations are 
known. We wish to estimate the unrestricted correlation Riz, when we 
also know rio, 713, and ras. The formula is 


y2 
“3 
rie + risres heh | 


o"3 
>2 
=) 
o"3 


(Correlation corrected for restriction (13.31) 


v2 
come) (et 
o"3 
of range, Case III) 


Re = 


to 


where the symbols are defined similarly to those in formulas (13.29) and 
(13.30). This formula would apply to the correlation of a new, experi- 


SPECIAL CORRELATION METHODS AND PROBLEMS 351 


mental test X, with a practical criterion Xə, when selection had been 
made on the basis of a third variable (pilot stanine, for example) X3. 
The reader may have been somewhat surprised at the rather radical 
change in correlation that occurred as we corrected for restriction of 
range in the two hypothetical problems above. To show that these 
changes are not unreasonable, some data will be cited from the AAF 
results.! An experimental group of more than a thousand pilots had 
been permitted to enter training without any selection whatever on the 
basis of either qualifying or classification tests. We know, then, how the 
pilot stanine and certain classification tests correlated with the graduation- 
elimination criterion at the end of training. We can also arbitrarily pull 
out a high segment of the total sample and within that limited sample 


compute validity coefficients. The results are given in Table 13.12 for 


TABLE 13.12.—VALIDITY COEFFICIENTS FOR SELECTIVE TESTS AND A COMPOSITE SCORE 
FOR THE SELECTION OF PILOT STUDENTS WITH AND WITHOUT RESTRICTION OF 


RANGE 
Correlation Correlation in 
in the total the selected 
Variable group highest 


13 per cent 
(N = 1036) (N = 136) 


Pilot stanine. .....:sc5 cece ee cece ces 64 .18 
Mechanical principles. 44 03 
General information. . -46 .20 
Complex coordination. . - -40 = 
Instrument comprehension. . -45 .27 
Arithmetic reasoning... - Bi 18 

-18 -00 


Finger dexterity. . goers 


the instance in which a rather high, but not unknown, selection of the 
top 13 per cent occurred. It can be seen that where there were sub- 
stantial correlations in the unrestricted sample the correlations in the 
selected group often shrank close to zero, and in one instance, to a trivial 
negative r. On the whole, those tests that correlated highest with the 
stanine lost most in validity correlation because of selection on the basis 
of the stanine. F 

Evaluation of the Correction Formulas for Restriction—It should be 
repeated that the problem of restriction is important, and that if one 


Research problems and techniques. AAF aviation 


1 Thorndike, R. L. (Ed.). 
Sia Washington, D.C.: Government Printing 


psychology research program, Report No. 3. 
Office, 1947. 


352 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


wishes to avoid wrong conclusions, when a substantial amount of selec- 
tion has been made, one should apply correction procedures. Had we 
taken the second (restricted) set of coefficients in Table 13.12 seriously, 
without other knowledge to the contrary, we would probably have con- 
cluded that formerly valid tests, and even the stanine, had lost their 
former validities that were known early in the war when selection was a 
cause of little restriction. 

It should be remembered that the formulas rest on the assumption of 
normal distributions of the population on the variables used, and the 
Pearson product-moment r is presupposed. The use of the biserial r or 
tetrachoric r as an estimate of it raises considerable question when selec- 
tion is severe. Experience tends to show, however, that when the 
biserial 7 is used as the validation coefficient, the formulas tend to under- 
estimate the unrestricted correlation. Better formulas are probably being 
developed and time will tell whether they will replace those based upon 
Pearson’s solution.!_ The standard errors for these corrected coefficients 
are unknown, but it is probable that they are much larger than those for 
Pearson r’s of comparable size. é 

Correlations in Heterogeneous Samples.—Studies of validity of tests 
and examinations have frequently been faulty from a number of stand- 
points. The use of school marks as criteria of success in training is in 
itself a questionable procedure, school marks being derived as they gen- 
erally are on the basis of measurements of questionable reliability and 
validity and contaminated with irrelevant factors. This situation alone 
stacks the cards against high validity coefficients for predictive indices 
at the start. 

There is another factor working against fair tests of validity that we 
shall face particularly here, a factor also dependent upon the unwarranted 
faith in school marks as absolute and dependable measures of scholarship. 
This factor is the indiscriminate pooling of marks from different subjects 
and from different instructors and treating them as if they were of the 
same kind of coin. Any cursory inspection of grade distributions in a 
single institution of learning will show that marks are not by any means 
of constant value when obtained from different sources. The reader is 
referred to the situation in Fig. 14.2 where students in an English course 
making the same score in a common achievement examination were 
assigned different marks in different sections and by different instructors, 
probably within the same section. If it is assumed that the comprehen- 
sive examination was a valid measure of the students’ relative degree of 
mastery of the objectives of the course, it can be seen how much other 

1 Thorndike, ibid. ? 


eo 


SPECIAL CORRELATION METHODS AND PROBLEMS 353 


factors must have entered into the determination of the final mark in the 
course. 

Reference to Fig. 14.2 will show that there is quite a range of scores, 
from about 85 to 125, within which students were assigned marks all 
the way from F to B. Only as between marks of F and A is there rather 
complete lack of overlapping. Striking as this situation is, it is probably 
rather representative of how much lack of correlation there is between 
school marks and genuine achievement. Much of this is due to the 
fluctuation of marking ideas and ideals from instructor to instructor. 
This variation from set to set of marks when they are collectively corre- 
lated with other measures is bound to alter the apparent amount of 
correlation. 

As an example, in six sections of freshmen English, within sections the 
correlation between quiz averages for the semester and a final compre- 
hensive examination ranged from .63 to .92, with an over-all correlation 
within sections, when intersection differences had been eliminated, of .83. 
Yet when the six sections were combined, with intersectional differences 
left in, the correlation was reduced to .71. It was interesting to find that 
between sections the correlation was —.17, which means that there was a 
very slight tendency for sections with average lower achievement to be 
given a higher average quiz mark! This fact accounts for the reduction 
in correlation from .83 to .71 when sections were combined.1 

Figure 13.5 pictures the kind of situation just described, in somewhat 
exaggerated form, in Diagram II. Diagram II is best understood by 
contrasting it with Diagram I. In the latter we have a homogeneous 
combination of four subsamples drawn from the same population. The 
correlation between X and Y within each subsample is indicated by a 
smaller ellipse. All the ellipses are of about the same shape, indicating 
about the same degree of correlation of X and Y. The x marks indicate 
the means of Y and X within each subsample. If we combine the four 
samples, we obtain a distribution described approximately by the large 
dotted ellipse. Note that the proportions of the large ellipse are about 
the same as for each small ellipse, indicating the same level of correlation 
within the composite distribution as within each subsample. Note, also, 
that the distribution of the four means forms roughly an ellipse of similar 
proportions. If the correlation between means of Y and means of X 
differs from that within subsamples, the correlation of X and Y in the 
composite sample will differ from that within subsamples. 

In Diagram II of Fig. 13.5 we have a very different situation. While 

1 Further discussion of “within” versus “between” correlations when groups are com- 
bined will be found in E. F. Lindquist’s Statistical analysisin educational research. 219. 


354 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


within each subsample the correlation between K and L is the same, the 
subsamples did not arise from the same population so far as means are 
concerned. An ellipse drawn to inclose the x’s would slant in the direc- 
tion to assure a negative correlation between means of K and means of L. 
‘The effect of this can be seen in the dotted line enclosing all subsamples. 
Its form suggests approximately zero correlation. Such situations are not 


Scores inyY 
Scores inL 


Scores in X Scores ink 


Diagram! Diagram 


Scores inT 


Scores in S 


Diagram IT 


Fic. 13.5.—Illustration of correlations in homogeneous and heterogeneous groups of 
subsamples. 


uncommon. In general practice, if it is doubtful whether subsamples 
arose by random sampling from the same population, it would be best to 
compute correlations within subsamples separately or to apply equivalent 
procedures which we will not take the space to describe here. The 
hypothesis of homogeneity of samples can be made by means of ¢ tests 
or F tests as described in Chs. 9 and 10. 

The Correlation of Averages.—It was stated in an earlier chapter in 
connection with tests of significance of differences between statistics 


1 See Lindquist, of. cit. 


SPECIAL CORRELATION METHODS AND PROBLEMS 355 


(Ch. 9) that the correlation between averages of samples is equal to the 
correlation between individual pairs of measurements. This statement 
assumes random samples from a homogeneous population. Diagram I in 
Fig. 13.5 illustrates this kind of situation and shows how an z obtained 
within one sample can be used as an estimate of a correlation between 
means. Diagram II shows how a correlation coefficient obtained within 
a single sample might be very misleading as to the amount of correlation 
between means. This shows an instance in which the correlation between 
means is decidedly lower, if not reversed in sign, than that within samples. 

The correlation between means could also be higher than that within 
samples, as Diagram III shows. An example of this would be the correla- 
tion between JỌ and salary. Correlating individuals, we should find some 
positive correlation, but because of great variations in salary at any single 
IQ value, the correlation might not be very high. If we divided men into 
sets according to vocation and correlated average IQ with average salary, 
the coefficient would probably be very high. This is because people of dif- 
ferent JQ levels gravitate to certain occupations, and occupations as such 
have established characteristic salary scales. Other factors that make for 
individual differences in salary within occupations are thus minimized in 


importance. The sampling is biased the moment we divide groups along 


occupational lines. . 
Averaging Coefficients of Correlation.—One solution to the problem of 


correlations in some heterogeneous samples is to estimate the correlation 
between X and Y within each subsample and then average the coefficients 
in order to obtain a single estimate of the population correlation. This 
would presumably describe the relation between X and Y throughout the 
composite sample, free from whatever sampling biases there may have 
been in segregating the subsamples. Before averaging coefficients, how- 
ever, we must make the assumption that the several 7’s did arise by 
random sampling from the same population—same with respect to the 
degree of correlation. It should go without saying, also, that we have 
correlated the same variables in all samples. The test of homogeneity 
of the r’s themselves would be based upon their standard errors. 

veral procedures sometimes used in averaging 7’s. Coef- 
ficients of correlation are not values on a scale of equal metric units; they 
are index numbers. Differences between large 7’s are actually much 
greater than those between small 7’s. If the few sample 7’s to be aver- 
aged, however, are of about the same value and if they are not too large, 
a simple arithmetic mean will suffice. If the ps differ considerably in 
size and if they are too large (above 80) some writers urge the procedure 
using Fisher’s Z coefficients. This is illustrated in Table 13.13. It con- 


There are se 


356 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 13.13.—DEMONSTRATION OF AVERAGING COEFFICIENTS OF CORRELATION WHEN 
r’s DIFFER IN RANGE AND IN SIZE 


Sample 4 Sample B Sample C Sample D 
Mean ofr |z method |Mean of r |z method [Mean of r| z method Mean of r| z method 
-45 -48 sf 97 235 37 -65 78 
-50 S5 -80 1.10 55 -62 85 1.26 
42 45 12 .91 -68 83 -98 2.30 
-38 -40 -68 -83 -50 195 -80 1.10 
55 62 85 1.26 58 -66 . 88 1.38 
z 2.30 2.50 3.80 4.75 2.66 3.03 4.16 6.82 
M.: -50 1.014 - 606 1,364 
M, .46 -46 -76 v7 .532 . 543 -832 .877 


sists of transforming each r into a corresponding z (Table H may be used 
for this purpose), finding the arithmetic mean of the z’s, and finally, con- 
verting the mean z back to the corresponding r. 

The results of Table 13.13 show differences to be expected in the use of 
an arithmetic mean of 7’s and of corresponding z’s. Samples A and B 
have the same range of 7’s, those in B being merely .30 greater than those 
in A. In sample 4, agreement is perfect in the results from the two 
methods. In sample B, the mean v by the z method is .01 higher (.77 as 
compared with .76). In samples C and D there is much more spread in 
the r’s averaged. For the r’s of moderate size, sample C, the z method 
gives a result only .01 greater than the simple mean of 7’s. In the high 
coefficients, however, the difference is about .05. There is serious ques- 
tion whether 7’s differing as much as these would satisfy the belief that 
they came from the same population by random sampling, and hence 
would probably not be candidates for averaging. When a few r’s do 
satisfy this belief, the chances are that any discrepancy between a simple 
mean of r’s and an average obtained by the z method would be so small 
as compared with the standard error of r that we could readily forgo 
the use of the extra effort of the z method. If the r’s did come from the 
same population, a mean of several would be a much more reliable esti- 
mate of population correlation. With the requirements satisfied, we 
could add degrees of freedom from the different subsamples to represent 
the degrees of freedom of the mean r and interpret its reliability and sig- 
nificance accordingly. 

Weighting Coefficients in Averaging—One more requirement should be 
mentioned, particularly if the last operation, combining degrees of free- 
dom, is to be carried out. That is to weight the obtained r’s in averaging 


E . N 


SPECIAL CORRELATION METHODS AND PROBLEMS 357 


them. The weight for each sample is its number of degrees of freedom 
(NV — 2). In using the z method, the weights are applied to the z’s. 
A discussion of weighted averages was given in Ch. 4. 

The Correlation of Parts with Wholes.—We frequently want to corre- 
late a part measurement, such as a part of a test battery, or a test item, 
with the whole of which it is a part. Since the variance of the total is in 
part made up of the variance of the component, that fact alone intro- 
duces some degree of positive correlation. The greater the relative con- 
tribution to the total variance by the component, the more important is 
this “spurious” factor. It is possible in a particular instance that the 
part is totally uncorrelated with the remaining parts and yet will be corre- 
lated with the total. If it is negatively correlated with the remaining 
parts, it will be less negatively correlated with the total. 

If each part contributes statistically about the same amount of variance 
to the total or if the part is one of a great many, so that its proportion of 
contribution is relatively small, we can compare correlations between parts 
and total with some confidence that they are compared on a very similar 
basis. But if these conditions do not obtain, we should do better to 
correlate each part with a composite of all other parts. When such a 
composite is unknown or is hard to obtain, we can still estimate the corre- 


lation by means of the formula 


Tipt — 9; (Correlation of part with a re- 
in 2 mainder, ae correlation (13.32) 


Tq = 
Vout oy — 2ripoiop of part with total 


where p = part score. 
t = total score. 
q = t — p, in other words, the total with the part excluded. 


In the correlation of test items each with the total score of the test of 
which they are a part, particularly, it is important to know about how 
much a part would correlate with the total when there is really no relation- 
ship at all. We can estimate this, but only under the condition that each 
part has the same variance and there is zero intercorrelation among all 
parts. Under these special conditions the average amount of correlation 


of a part with the total is given by the equation 


= 1 (Average correlation of a number of parts, of equal vari- 
To = Bs ance and zero intercorrelation, with their total) (13.33) 


in which n = the number of parts. 

In a test composed of 10 such items, the average 7p would equal .316; 
with 20 items the corresponding figure would be .224; with 30, .183; and 
with 40, .158. These values should serve as guides. Any obtained part- 


358 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


whole coefficients of these sizes even though statistically significant might 
reflect an actual zero correlation of part with total. Application of the 
correction formula above would be a check upon these low correlations. 

If we should want to know the correlation of a part with a whole of 
which it is a part and we already know the correlation of the part with 
the remainder of the whole, the estimate is made by the equation 


Op + TpqFo (Correlation of part with whole 
= = knowing the correlation be- 
Vap + 079 + rpg poo tween part and the remainder) 
in which the symbols have the same meaning as in formula (13.32). The 
utility of this formula is probably rather limited. It is given primarily 
to show what happens when two parts that correlate zero are combined. 
If 7pq is .0 in formula (13.34), the numerator reduces to øp. The denomi- 
nator is actually the standard deviation of the composite (p + q). The 
deduction is that if two parts correlate zero, when combined the corre- 
lation of the part with the total will be equal to the ratio of the standard 
deviation of the part to that of the total. 

Index Correlation.—This is usually called spurious index correlation for 
the reason that when indices such as IQ, EQ, or AQ are correlated with 
each other, 7 is markedly influenced by the fact that these ratios have in 
common such factors as chronological age and mental age. JQ’s from 
two different tests are derived from the MA’s obtained from the two tests 
each divided by the same CA. If there isa range of CA in the group corre- 
lated, this fact in itself introduces some positive correlation. 

Table 13.14 will show by means of a purely fictitious and overdrawn 
picture how this phenomenon works. For eight children who differ in 


TABLE 13.14—DEMONSTRATION or How Inpex Nusmers May Acquire A HIGH 
DEGREE oF CORRELATION BECAUSE or A COMMON DENOMINATOR: AN EXTREME 


Tr = 


(13.34) 


CasE 

Child Chronological || Mental | Mental IQ IQ 
age age I age II L II 

A 5.0 7 8 140 160 
B 5.5 8 8 145 145 
c 6.0 7 7 117 117 
D 6.5 8 F 123 108 
E 5 8 8 106 106 
F 8.0 7 8 88 100 
G 8.5 8 7 94 82 
H 9.0 7 7 78 78 


Correlation between mental ages I and II = .00 
Correlation between JQ’s I and II = .92 


SPECIAL CORRELATION METHODS AND PROBLEMS 359 


chronological age from five to nine inclusive, mental-age ratings on two 
different tests are given. These are obviously selected children, since 
their mental-age values hover at seven and eight in a haphazard manner. 
Note, however, how the /Q’s spread, from 140 through 78. The spread 
in IQ’s is almost entirely due to the spread in chronological ages. Since 
each child has the same chronological age for both JQ’s, that same denomi- 
nator of the ratio of his MA to CA assures that his /Q’s will be about the 
same. Some J(Q’s go up together in the two tests for children of low CA 
and others go down together, for children with higher CA. The corre- 
lation computed between JỌ’s is .92. The same sort of phenomenon goes 
on in the actual situation to a lesser extent when there is an appreciable 
range of chronological age. 

In the author’s opinion, the term spurious is not to be confined to this 
type of situation in particular; for in a sense, all correlations are spurious 
to the extent that they are influenced by the conditions under which they 
were obtained. If one remembers what /Q’s are and interprets corre- 
lations between them accordingly, no particular falsification of the facts is 
in question. The important thing is that one should correlate variables in 
the full knowledge of how the measurements were obtained, if possible, 
and should report to his readers the facts needed for wise interpretation, 
whether it be variability of the correlated group or range of CA’s involved 


when JQ’s have been correlated. 
The real difficulty comes when investigator or reader takes JQ’s to be 


some real, absolute properties of individuals, on the one hand, and when 
someone not oblivious to the common CA factor plays it up as a fatal 
> on the other hand. Both should remember the rela- 
tive nature of all correlation coefficients. The important thing is that 
the wary investigator should not attribute his results to some supposed 
real nature of psychological or educational phenomena when some property 
of statistical treatment is really responsible. Nor will the sophisticated 
critic fail to grant the utility of certain procedures shown to be fruitful 
under the circumstances of operation even when some “spurious” element 
has entered the picture. Errors, too, are relative matters. What is an 
error from the point of view of one frame of reference may be the truth 
when the frame of reference is changed. 

Correction in r for Errors of Grouping.—If in computing a Pearson + 
by means of grouping data in class intervals, a small number of classes 
either way has been used, the estimate of correlation is lowered to some 
degree. In the limiting case, of two classes each way, the computed r is 
less than two-thirds of the r had there been no grouping. When the 
number of intervals is 10 both ways, 7 is about 3 per cent underestimated. 


source of ‘error,’ 


360 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


For any number of classes in X or in Y, we can correct for the error of 
grouping by dividing r by a constant corresponding to that number of 
classes. 

The correction is necessary because errors of grouping yield over- 
estimates of the standard deviations, as was pointed out in Ch. 5. If 
Sheppard’s correction has been applied to both standard deviations, no 
further correction is necessary in the coefficient of correlation. 

Table 13.15 supplies the list of constants given by Peters and Van 
Voorhis to be used in making corrections in r.! Correction is made for 


TABLE 13.15—Correcrion Factors ror ERRORS or GROUPING IN THE COMPUTATION 
or PEARSON’S r WHEN DISTRIBUTIONS ARE NORMAL AND Mrppornts or 
INTERVALS STAND FOR CASES IN THE INTERVALS 


Number of in- 
` tervals...... 2 3 4 5 6 7 8 9 | 10| 11, | 12 |15| 14] 15 


Correction 


factor...... - 816]. 859|. 916|. 943|. 960|.970|.977/.982|. 985]. 9881 - 990) .991).992).994 
Squared cor- 

rection fac- 

BOUsneaishieiere - 667|. 737|. 839|. 891!.923].941].955].964|.970|.976 - 980|. 983|. 985|, 987 


the number of categories or intervals in Y as well as in X. The correc- 
tion factors are used in the following manner. Suppose we have an 
obtained r of .61 in a problem with 8 intervals in X and 9 in Y. The 
correction factors for these numbers of intervals are .977 and -982, respec- 
tively. The correction is made by dividing the obtained r by the prod- 
uct of the two correction factors. In terms of a formula, 


Ta = (Coefficient of correlation corrected for coarse grouping) (13.35) 


Cxly 


in which cz and cy are the correction factors for variables X and Y , respec- 
tively, based upon the number of class intervals in each. Applied to the 
correlation of .61 with 8 and 9 categories in X and K; 


rane 61 
°  (.977)(.982) 
= .626 (or .63) 
When there are the same number of intervals in both X and Y, the cor- 


1 Peters and Van Voorhis, of. cit., p. 398. 


SPECIAL CORRELATION METHODS AND PROBLEMS 361 


rection factor is the same for both and the factor squared would be called 
for in the denominator of formula (13.35). The factors squared are given 
for this purpose in Table 13.15. 

When the number of intervals in either X or Y is less than 10 it is 
good practice to apply this correction procedure; certainly when the num- 
ber of intervals is 8 or below. There is most to be gained in accuracy of 
estimate of r when the obtained r is large; little to be gained if 7 is small, 
particularly if the sample is small. When the corrective change is small 
compared with the size of the standard error of r, there is little use in 
making the correction. It should be remembered that the correction 
factors given in Table 13.15 are designed especially for the situation in 
which the midpoint of an interval is the index number for cases in that 
interval, the intervals are equal in size, and the distributions are normal. 
For other, less common situations, see the reference below.! 

Correction of for Coarse Grouping.—Since the phi coefficient is a 
product-moment estimate of correlation, the question arises as to whether 
it is ever subject to this kind of correction. This question should arise 
only when one or both variables are actually continuously measurable 
and we want a more realistic estimate of correlation that describes the 
relationship that exists when the variable is used in graded form. As to 
number of “intervals” we have two each way when ¢ is computed. The 
index number for each interval is not the midpoint, however, but is the 
mean of the cases in the interval. If we can assume actual normal dis- 
tribution for the variable so correlated, the correction factor is .798. The 
square of this, which we would use if both variables are normally dis- 
tributed, is .637. The use of the correction factor .798 would estimate 
a point-biserial r from ¢. The use of the squared factor .637 would 
estimate a Pearson r from ¢. 

An important reservation, however, should be added. Remembering 
the severe restriction in the size of ¢, it is probably correct to say that 
unless p = p’, or p = g’, in which case only can a maximal ¢ equal 1.0, 
a corrected phi will not be equivalent toa Pearson r. It is well to remem- 
ber, too, that some obtained ¢’s are greater than .798 or .637, which 
would mean corrected coefficients greater than 1.0. In such events, it is 
probable that the assumptions of continuous, normal distributions are 
unjustified; the distributions probably are bimodal, if not, genuine point 
On the whole, the application of correction of ¢ for coarse 


distributions. nes 1 f 
imited and questionable that it is best to think twice before 


grouping is so l 
applying it. 


1 Peters and Van Voorhis, of. cit., p. 398. 


362 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Exercises 


1. Compute by the rank-difference method the correlation between the first 20 
scores in any two variables in Data 84. Find the standard error of rho. Interpret 
your results. 

2. Compute for Data 15A a correlation ratio for the prediction of Y from X. Find 
the standard error of eta and the standard error of the estimate. Apply the chi-square 
test of linearity. Interpret yoursresults. 

3. Find from the literature three applications of the correlation ratio. State how 
the author used eta, and give his reasons, if stated. What subsidiary tests (of linearity, 
etc.) were made? Make your judgment as to the effectiveness of the uses of eta in the 
cases cited. 

4. If you have mastered the analysis-of-variance procedures as described in Ch. 10, 
make the application as suggested in this chapter to Data 154, following your solution 
of the correlation ratio. 

5. In the data in Table 13.7, combine the distributions receiving marks of A, B, or C, 
into a single composite; also, in another composite, combine those receiving marks of 
D and F. Compute for these data a biserial r between scores and marks. Find the 
standard error of rẹ. Interpret your results. 

6. Compute a tetrachoric coefficient of correlation for Data 144. Determine 
whether or not the correlation is probably significant. If the Thurstone computing 
diagrams are available,.check your solution by this means. 

7. Cite some fourfold tables found in this book to which the tetrachoric correlation 
method should be applied, and cite some others to which it should not be applied. 

8. Reduce to a fourfold table preparatory to computing a tetrachoric r the scatter 
diagram given in Data 154. Do the same for Data 11A and Data 8B. À 

9. Find in this volume, or any other source, data to which the phi-coefficient method 
of correlation may properly be applied. Give reasons. 

10. Compute a phi coefficient for Data 114, and make the necessary correction to 
yield an estimate of the Pearson r. If Exercise 8 has been completed, compare with 
the r; found there. 

11. Find in the literature examples of coefficients of correlation that might be regarded 
as spurious from some points of view. How did the author interpret them? How 
would you interpret them? i 

12. Apply the correction-for-grouping process to some product-moment coeficient 
you have obtained or to one you find uncorrected in the literature. 

13. The Pearson r for the data in Table 13.7 is .74. Correct it for errors of grouping. 
How does the change in the corrected r compare with ¢,? How do the uncorrected and 
corrected Pearson r’s compare with the tetrachoric r given for the same data? 

14. Determine the following partial r’s for Data 16A : rs4., 741.2, 721.5) 51.2. Interpret 
your results. Which of these coefficients have little practical meaning? 

15. Determine the following partial r’s for Data 16B: r319, rs1.4, Tardy 146.2) 41.24 
Interpret your results. Suggest other partial z’s that might be of importance to know 
about, and tell why. 


CHAPTER 14 
PREDICTION OF ATTRIBUTES 


One of the most important fruits of scientific invéStigation and one of 
the most exacting tests of any hypothesis is the ability to make predic- 
tions. So important is this topic that it deserves to have considerable 
space devoted toit. Particularly is this true for the reason that statistical 
reasoning is basic to all predictions. Statistical ideas not only guide us in 
framing statements of a predictive nature but also enable us to say some- 
thing definite concerning how trustworthy our predictions are—about how 
much error one should expect in the phenomenon predicted. The practi- 
cal significance of this cannot be questioned. The significance even for 
the scientific investigator is too often unrecognized or forgotten. 

One can find amateur prognosticators for almost any kind of event 
on every hand. Little note is made of the success or failure of their 
predictions. A few successes are sufficient basis for vindication of the 
“prophet,” and many failures are quickly forgiven and forgotten. The 
old adage “Where ignorance is bliss ’tis folly to be wise” must have 
been invented to fit this particular situation. On the other hand, the 
psychologist or educator who falls short of perfect predictions is often 
immediately condemned and his further predictions thought to be dis- 
credited. The average uninformed person is somehow partial to vague 
and “magical” means of prediction, and he can readily overlook their 
shortcomings, whereas he will not tolerate the statistically hedged pre- 
diction that also yields to him a more exact knowledge of its limitations. 
Tf he could only realize how poor the predictions of the amateur prophet 
actually are, he would perhaps have a more ready respect for the scientific 
prediction of events in human affairs. It is the purpose of this chapter, 
and the next two, to illustrate the kinds of predictions the statistically 
oriented investigator makes and how he not only does not blind his eyes to 
his failures but brings them clearly into the light. 

General Types of Prediction.—Although in this volume we have gener- 
ally emphasized measurement, we have had to recognize from time to 
time that complete measurements cannot be made and that data are 
sometimes obtained as merely classified in categories. The latter type of 


data we recognize as enumeration data rather than as measurements. It 
. 363 


364 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


is a matter of assigning attributes to cases rather than quantitative evalu- 
ations on a linear scale, for example, identifying individuals as to sex, race, 
political party, or criminality. Although such data are not allocated to 
linear-scale positions, we can still make predictions from them and of them 
from other information. We thus have four cases of predicting: 

1. Attributes from other attributes—as when we predict incidence of 
criminality from sex, race, or religious creed. 

2. Attributes from quantitative measurements—as when we predict 


„criminality from scores on tests of ability or of behavior traits. 


3. Measurements from attributes—as when we predict probable test 
scores from sex, socioeconomic status, or martial status. 

4. Measurements from other measurements—as when we predict 
achievement in school from JQ-test scores. 

General Ways of Evaluating Accuracy of Prediction.—Predictions are 
obviously sound if they prove to be correct. The degree of correctness is 
indicated by how often or how nearly we hit the mark. In the case of 
predicting attributes, our success can be numerically indicated in terms 
of the percentages of “hits.” But a more accepted way among statis- 
ticians is to ask how much better our predictions are than if we had not 
used the information we have—in other words, if we had not tried to pre- 
dict one thing from the knowledge of another but merely from a knowledge 
of the predicted population itself. A more crude way of saying it would 
be to ask how much better our predictions are than guesswork. But this 
does not mean pure guesswork, as we shall see later. 

In predicting measurements, whether from attributes or from other 
measurements, we ask a similar question. But whereas in predicting attri- 
butes for cases, we work in terms of the number of hits or misses, in predict- 
ing measurements, we work in terms of kow far on the average we have 
missed the mark. We compare this average deviation between fact and 
prediction with the average of the errors we should make without using the 
knowledge we did as a basis of prediction. 

Let us see in a preliminary way what this means. We can predict that 
a student’s mark in a course will be somewhere in the range from A to F 
inclusive, and most probably it will be a mark of C, which more students 
earn than any other mark. This prediction is made without knowledge 
of the student’s scholastic-aptitude score, and its margin of error is meas- 
urable in terms of the standard deviation of the distribution of marks of 
all students. If we used knowledge of the students provided by aptitude- 
test scores, we should predict some to earn marks higher than C and some 
lower than C. The average of our deviations between prediction and fact 
will now be smaller than the standard deviation of the distribution of all 


PREDICTION OF ATTRIBUTES 365 


marks.. The difference between these averages of deviations tells us how 
much the knowledge of aptitude scores has improved our predictions. 


PREDICTING ATTRIBUTES FROM OTHER ATTRIBUTES 


Predictions Can Be Made in Both Directions.—As our first example of 
prediction of attributes from other attributes, let us consider the data in 
Table 14.1. Here we have the numbers of persons in a “depressed” 
TABLE 14.1.—DISTRIBUTION oF RESPONSES TO THE QuEsTION, “Woutp You RATE 


YOURSELF AS AN ImputsivE InpivipuaL?” As GIVEN BY Two EXTREME 
GROUPS oF STUDENTS 


Response 
Group 
Yes ? No Total 
Depressed iscesvescevn saws 72 45 133 250 


Not depressed. . 


group who responded by saying “Yes,” “?” and “No” to the question, 
“Would you rate yourself «as an impulsive individual?” and also the 
numbers of a group described as “not depressed.” The individuals in 
these two categories are the highest and lowest quarters of a sample of 
1,000 students who were ranked in terms of a provisional scoring on a 
personality inventory. Table 14.1 provides us with two prediction prob- 
lems. We can attempt to predict the verbal response to the question, 
knowing whether the person is in the depressed or not-depressed group; 
or we can attempt to predict the group to which a person belongs, know- 
ing what response he has made. Let us take the prediction of verbal 
response first. 

The Principle of Maximum Likelihood.—Considering first the depressed 
group by itself, we find that the largest number of them respond with 
“No.” Taking each member of the depressed group as he came along, 
we should predict for him the response “No.” If all 250 came up for 
inspection, we should be correct 133 times out of 250, or 53.2 per cent of 
the time. For other samples from the same depressed population, we 
should expect a similar ratio of correct predictions. This illustration sets 
the pattern for all predictions of attributes from attributes. The predic- 
tion always observes the mode or most frequent attribute in the segment 
of the population chosen at the moment. For the not-depressed group, 
the mode is also at the response “No”; hence that is our prediction also 


366 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


for them, and our percentage of accuracy is 43.6 per cent, not so high as 
before but higher than if we had predicted either “Yes” or “?” for this 
group. Such predictions follow the principle of maximum likelihood or 
maximum probability. Either a depressed or a not-depressed person in 
this population is more likely to respond “No” than anything else; so 
that is our prediction. 

The Forecasting Efficiency in Predicting Attributes.—How good are 
these predictions? Since we have predicted the same response for both 
depressed and not depressed individuals, we suspect that knowing to 
which group the person belongs helps us little if any to predict his response. 
A comparison of the percentages of correct predictions, however, tells us 
that we can be more sure of our prediction of “No” if the person is 
depressed than if he is not. - But no matter from what group the person 
comes, our prediction is the same; so it is as if we could make no use of 
the knowledge of his group affiliation for this purpose. 

Let us compare the number of successes of prediction made with and 
without knowledge of group affiliation. Taking both groups combined, 
we should predict for each person at random the response “No,” and we 
should be correct 242 times in 500, or 48.4 per cent. In the two groups 
predicted separately, we found successes of 133 and 109, which combined 
give us 242 correct hits, or 48.4 per cent. We have thus gained no more 
accuracy in predicting responses from a knowledge of group afiiliation 
than we could attain without this knowledge. The forecasting efficiency 
in predicting response from knowledge of group is therefore just zero. 
The work of calculating forecasting efficiency may be seen more clearly 
if summarized as in Table 14:2. 


TABLE 14.2.—Prepicrions or RESPONSE FROM KNowLEDGE OF THE GROUP 
MEMBERSHIP 


Group membership Predicted | Number | Per cent 


response | correct correct 
No 133 53:2 
No 109 43.6 
s 242 | 84 
Correct without knowledge..................... 242 48.4 
Bxcess with knowledge scca-seton oo ah piers ils arona 0 0.0 


The second prediction problem here is to reverse matters and predict 
group membership from knowledge of the response. All persons respond- 
ing “Yes” we should predict to be members of the not-depressed group, 


PREDICTION OF ATTRIBUTES 367 


since 106 actually are, as compared with 72 who are not. Again the 
modal attribute is our prediction. For those responding “?” the pre- 
diction is membership in the depressed group, and so also for those 
responding “No.” The percentages of correct predictions are given in 
Table 14.3 for each response and for all combined. Altogether, there are 


TABLE 14.3—Prepicrions or Group MEMBERSHIP FROM KNOWLEDGE OF VERBAL 


RESPONSE TO THE QUESTION 
Number | Per cent 


Response Predicted group onet COE 

Not depressed 106 59.6 

.| Depressed 45 56.3 

Depfessed 133 55.0 

284 56.8 

Correct without knowledge 250 50.0 
34 13.6 


Excess with knowledge. ......-+-++++++se00085 


284 correct predictions, or 56.8 per cent. Without knowledge of which 
response each person made to the question, but with knowledge that half 
the total population are depressed and half are not, our expected number 
of chance successes is 250. Our predictions with knowledge of responses 
yielded an excess of 34 or a forecasting efficiency of 13.6 per cent. We 
can say that our predictions with knowledge of response to the question is 
13.6 per cent better than those made without this knowledge would be. 
Prediction Not Equally Good in the Two Directions.—It is now well 
apparent that we can predict successfully group membership from knowl- 
edge of responses in this problem, whereas we cannot predict response 
from knowledge of group membership. It is not always true, as it is 
here, that successful prediction is possible in one direction and entirely 
impossible in the other, but it is a quite common finding that prediction 
is better in one direction than in the other when two variables are con- 
It will often clarify thinking about predictive problems to keep 
It is sometimes assumed by the uninformed that if 
A can be predicted from B, B can, in turn, be predicted from A. Such 
an assumption is likely to lead the unwary investigator into logical and 
practical difficulties when it is seriously wanting in applicability. This is 
a more serious matter in dealing with attributes than in dealing with 
measurements, for in the latter case the predictability of one measured 
trait A from a measured trait B is usually not very divergent from the 


predictability of B from A. 


cerned. 
this fact in mind. 


368 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The Sampling Procedure in Prediction of Attributes.—The evaluations 
of predictions already given are meaningful and useful. There is still the 
problem of how significant the decisions based upon the sample may be 
for the population. This calls for application of sampling statistics. For 
this purpose we can adapt the use of chi-square, ¢, and ¢ tests, all of which 
have been previously described. Their application here contains some 
new features that need to be explained. 

The Cell Square Contingency Method.—We can compute a chi square 
for the entire contingency table involved in the prediction problem, and 
that would be meaningful as an over-all index of significance of predictive 
value somewhere among the categories. As we saw in the previous 
examination of predictions, however, some predictions are apparently 
better than others within the same table. By breaking chi square down 
into components, or rather, by examining the contributions to chi square 
from the different categories, we obtain a more analytical picture of each 
one’s significant contribution to prediction. Table 14.4 shows the cus- 
tomary steps in the solution of chi square. The last segment of the 


TABLE 14.4—DrmonstratIon OF THE CELL SQUARE CONTINGENCY METHOD OF 
TESTING CONTRIBUTIONS TO PREDICTION à 


Expected frequencies Discrepancies 
Group fe ila 

Yes ? No Total | Yes ? No Total 

Depressed......... 89 | 40 121 250 —17| +5 +12 0 

Not depressed... , 895 | 40 e120 | 2505). 447) <5 =12>) A 

hii eF 178 | 80 | 242 | 500 0 0 oj 0 

Squared discrepancies Cell HE Sea 
Group (fo — fe)? fe 

Yes ? No Yes ? No Total 
Depressed evcn i 289 | 25 144 | 3.247 | 0.625 | 1.190 | 5.062 
Not depressed........... 289 | 25 144 | 3.247 | 0.625 | 1.190 | 5.0C2 
BOER i i enee 6.494 | 1.250 | 2.380 | 10.124 


1.54 


PREDICTION OF ATTRIBUTES 369 


table, in which are given the cell square contingenciés, is particularly to 
be noted. 

The chi square for the entire table is equal to. 10.12, which, with 2 
degrees of freedom, is significant just beyond the 1 per cent level. We 
next examine each column of the table, for the sum of the cell square 
contingencies for that column (the column square contingency) indicates 
the degree of significance to be attached to the category it represents. For 
the response “Yes,” the sum is 6.49. This may be regarded as a chi square 
for a two-cell table and tests the hypothesis that the depressed and the 
not-depressed groups should have responded “Yes” in equal frequencies 
to the question. With one degree of freedom, the departure from the 
hypothesis is significant almost at the 1 per cent level of confidence. The 
square root of chi square with one degree of freedom is equal to #, hence ¢ 
for this response is 2.55. For the other responses, “?” and “No,” the ¢ 
values are 1.12 and 1.54, both insignificant. Thus, we have a decision as 
to the sampling stability of the gains in accuracy of prediction as given in 
percentage terms in Table 14.3. Those percentages are 59.6, 56.3, and 
55.0 for the three responses, respectively. Only the first seems significant. 

As for the prediction of response from knowledge of group member- 
ship, the answer lies in the sums of the rows of cell square contingencies 
in Table 14.4. These sums are the same: 5.06. With 2 degrees of free- 
dom, they fail to be significant at the 5 per cent level. This outcome 
agrees with the decision based upon Table 12.2, where it was found that 
there were no excess correct predictions attributable to knowledge of group 
membership, depressed versus not-depressed. More accurately inter- 
preted, the row sums indicate that the distribution of responses of 250 
depressed individuals does not differ significantly from that of the 500 
depressed and not-depressed combined. The same may be said for the 
not-depressed group. When both are considered together, however,their 
mutual departure from a common, hypothetical distribution (that of the 
500 combined) is sufficient to yield a chi square of 10.12, which is signifi- 
cant. The corresponding coefficient of contingency (C) equals .14, which 
is another index of over-all predictive value. Because the chi square from 
which C was derived is significant at the 1 per cent level, so is C signifi- 
cantly different from zero correlation. 

Response Significance as Indicated by Phi—Another approach, which 
applies in the special situation in which one of two categories is to be 
predicted from knowledge of another variable in more than two cate- 
gories, uses the phi coefficient. Here we are interested only in the pre- 
diction of depressed versus not-depressed group membership from knowl- 
edge of response to a question. This approach is virtually an example of 


370 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the more general application of ¢ to the validation of responses to test 
items (see Ch. 17). When there are only two alternative responses to an 
item, the predictive value of the one response equals that of the other 
response, for lack of more than one degree of freedom. A ¢ coefficient 
would be quite suitable to indicate the correlation of each response to the 
item with a two-category criterion. When there are more than two 
responses, as in the present illustration, we can validate each response 
separately, although it is, to be sure, just one item, because there is more 
than one degree of freedom. The validity of any one response, or its 
correlation with the criterion, does not automatically determine the validi- 
ties of the others, though, of course, it will have some bearing upon that 
validity. 

The procedure is demonstrated in Table 14.5. There we have three 
different 2 X 2 contingency tables, one for determining the ¢ for each 
response. When validating one response we group the others into one 


TABLE 14.5.—TESTING THE Basis or PREDICTION Provmep By Eaci CATEGORY 
SEPARATELY BY MEANS oF CHI SQUARE AND PHI 


Response Response Response 
oe ? + Yes + Yes + 
Ye G es $$ es + rotal 
= S Total ? No Total| No > Tota 


72 | 178| 250] 45 205 | 250| 133 117 | 250 


106 | 144| 250] 35 215 | 250| 109) 141 250 
178 | 322 | 500| 80 420 | 500| 242| 258| 500 


x? = 10.08, $ = .142 x? = 1.49, $ = .055 x? = 4.61, ọ = .095 


category. The two categories when validating response “Yes” are 
responses “Yes” and “Not yes,” and so on. The ¢’s for the three 
responses are .142, .055, and .095, respectively. This is another basis 
of comparing the effectiveness of the three responses as discriminating 
between depressed and not-depressed groups. We cannot be very sure 
that the differences in size of ¢’s are significant, since we do not have 
standard errors of the ¢’s. We can test the hypothesis of zero correlation, 
however, by means of the chi squares, which are 10.08, 1.49, and 4.61, 
respectively. These are to be interpreted as very significant, insignifi- 
cant, and significant, for responses “Yes,” “?” and “No,” respectively. 
These chi squares come in the same rank order as the column square con- 
tingencies (see Table 14.4) but they are somewhat larger than the latter. 


PREDICTION OF ATTRIBUTES 371 


The differences are to be attributed to a difference in operations. The 
sum of the three chi squares (10.08 + 1.49 + 4.61) obviously exceeds the 
sum of the three column square contingencies because each column is 
included more than once in the three 2 X 2 tables. There is a difference 
in meaning, also. In computing the phi coefficients, we have asked, 
“What is the predictive value of a selected response versus all other 
responses?” If we predict one group membership in this problem from 
the responses “Yes,” we automatically predict the other group member- 
ship for all other responses. We find that it paid to group responses Lier 
and “No” together but it definitely was not so profitable to group any 
other pairs of responses. The function of the “?” response was much 
the same as that of the “No” response. This could have been seen in | 
the original table (Table 14.1), in which the directions of differences in 
frequencies were apparent. It was also apparent in that the same pre- 
diction was made from the two responses. The tests of sampling signifi- 
cance bear out those observations. We would obtain as much predictive 
value by treating responses “?” and “No” as if they were identical as 
we would by giving them individual weighting, as shown by the fact that 
when we combine them the chi square (10.08) is about the same as for 
the entire contingency table (10.12) when the two responses are kept 
separate. This is also shown by the fact of insignificant @ for the four- 
fold table featuring the “?” response in Table 14.5. 


PREDICTING ATTRIBUTES FROM MEASUREMENTS 


We sometimes wish to decide on the basis of known measurements 
whether an individual should be expected to be in one category, e.g., to 
have a certain attribute, or whether he should be expected to be in another. 
Sometimes it is a matter of making placements in different categories in 
order that the individual may expect a better consequent-adjustment or 
greater satisfaction. Such is the case when we attempt to predict success 
or failure for persons for whom we know certain test scores. This prob- 
lem was solved in principle by Guttmann.* Here the author will attempt 
some workable procedures whereby such predictions can be 


to provide 
made and their relative accuracy determined. 

Critical Points Dividing Distributions——In Fig. 14.1, we have two 
differing in mean, standard deviation, and in N. We wish 
he scale of measurement that will give us the maxi- 
so that we may say of an individual whose 
t that he is probably a member of the upper 


New York: Social Science Research 


populations, 
to find a score on t 
mum accuracy of prediction, 
score is higher than that poin 

1The prediction of personal adjustment. 
Council, 1941. Pp. 271f- 


372 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


group and of an individual whose score is lower than that point that he is 
probably in the lower group and in so predicting, make the minimum 
number of mistakes. Let us call that critical point E. 

According to Guttmann’s solution, point E comes on the scale where 
the two distributions have equal ordinates—in other words, where the 
two curves intersect (see Fig. 14.1). At this point, persons with scores of 
this value are equally likely to be members of either group. Above this 
point, at any score there is greater likelihood that the person belongs in 
the upper group than that he belongs in the lower group. Below this 
point, at any score, there is a greater likelihood that the person belongs in 
the lower group. The terms upper and lower here apply only to relative 


Distribution for 
Attribute B 


Distribution for 
Attribute A 


Fic. 14.1—Distribution of two hypothetical groups possessing two distinguished attri- 
butes, A and B, when measured on the same scale of some other variable. The aim is 
to predict for each person his attribute from knowledge of his score. For those with 
Scores above point Æ we predict attribute B as being more likely; for those below Æ, we 
predict attribute A. , 


position on the measuring scale. The two distributions are divided 
according to two qualities or attributes, and it is possession of those 
attributes that we are trying to predict. As we proceed along above 
point Æ, the probability that we are correct in our prediction increases, 
since the ratio of individuals having attribute B to the number having 
attribute 4 keeps increasing. At point B, which is the upper limit of 
the range of the A group, and above B, we should have absolute certainty 
of prediction so far as these particular populations are concerned. Like- 
wise, below point A, where the upper distribution ends, we should be 
absolutely certain that no case possesses attribute B. But if the two 
populations are taken as wholes, the shaded portions stand for the pro- 
portions of individuals incorrectly predicted. The cross-hatched section 
(of distribution A) represents the A’s wrongly predicted to be B’s, and 
the stippled section (of distribution B) represents the B’s wrongly pre- 
dicted to be A’s. All the B’s above point £ are correctly predicted. It 
is on the basis of these numbers of correctly and incorrectly predicted 


PREDICTION OF ATTRIBUTES 373 


cases that we can judge the forecasting efficiency, as we shall see later. 
First, let us see how point Æ can be determined. 

Locating a Critical Point for an Artificial Dichotomy.—The principle 
upon which the point of division is made on the continuous variable is a 
variation of the principle of maximum likelihood. For scores above the 
critical value, the probability of a case being in the upper category is 
greater than .5. For scores below the critical value, the probability of a 
case being in the upper category is less than .5. 

The location of the critical division point depends to some extent upon 
whether the dichotomy is a genuine one or whether it is an artificial one 
based upon continuous measurements. There are several methods that 
can be used to solve the problem. Some apply to either kind of dichot- 
omy, some to one or the other but not to both. We will begin with 
methods that apply to the artificial dichotomy. 

As illustrative material, let us use the data in Table 14.6. A large 
group of students were given the same comprehensive final examination 
in freshman English. Each instructor was at liberty to use the scores in 
this examination along with other measurements as he saw fit.in deriving 
a final mark in the course for his students. Taking all marks collectively, 
for all students receiving a mark of F, a frequency distribution of their 
examination scores was set up. The same was done for students receiving 
marks of D, C, B, and A. These are the five distributions listed in 
Table 14.6 and shown graphically in Fig. 14.2. The amount of over- 
lapping in ability as represented by examination scores among these five 
groups is noteworthy, but it probably represents a not unusual situation 
where marks are determined in the customary manner. However that 
may be, let us say that students receiving F’s are, in the judgment of the 
teachers, failing students, and those receiving D’s are D students, etc. 
These five categories represent five attributes as judged by these instruc- 
tors. Let us take as our problem the task of predicting what attribute 
will be assigned to students making certain scores in the examination. 

Graphic Methods of Locating the Critical Point.—When the overlapping 
distributions are plotted as in Fig. 14.2, if they are fairly regular in con- 
tour, one can immediately locate the points at which the two distributions 
intersect. Distributions for attributes F and D intersect just below a 
score of 60; more exactly, by inspection, at 57 or 58. In this approach, 
it would be well to locate the point between two whole numbers, because 
scores are obtained in whole numbers. In this case, we should predict 
an F for students making a score of 57 or lower, and a mark of D for 
those making a score of 58 or above (at least up to the critical point 
between D and C). Between D and C, the critical point, by inspection, 


374 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 14.6.—DISTRIBUTIONS oF SCORES IN A GENERAL ENGLISH EXAMINATION MADE 
. BY STUDENTS RECEIVING VARIOUS MARKS IN THE COURSE 


Scores A B € D F 

180-189 1 

170-179 1 1 

160-169 5 7 1 

150-159 7 13 3 

140-149 2 26 10 1 

130-139 2 34 24 5 1 

120-129 0 40 39 7 0 

110-119 1 21 81 13 3 

100-109 19 89 28 4 

90- 99 4 81 29 9 

80- 89 1 42 46 8 

70- 79 16 29 li 

60- 69 5 20 

50- 59 6 11 

40- 49 1 

30- 39 3 

20- 29 0 

10- 19 0 

oO 9 1 

Sums....... 19 166 391 185 65 


a i 


a TE 


E 


PREDICTION OF ATTRIBUTES 375 


seems to be at about 87, probably on the lower side. Thus, for scores 
58 through 86, we should predict a mark of D. The next critical point 
seems to come between 124 and 125. The prediction of a C arises for 
scores 87 through 124. The critical point between B and A is almost 
impossible to determine but seems to lie in the region of 170 to 175. The 
small number of A’s makes any solution of this kind uncertain. 

Should overlapping distributions be irregular in contour, particularly 
in the neighborhood of the intersection point, if the data are not too 
limited, and if the smoothing required is rather obvious, it would be 
well to resort to smoothing before the point of intersection is sought 
(see Ch. 3 for a description of smoothing procedures). 

This graphic method of determining a critical dividing score point may 
do for rough estimates when samples are large and contours of distribu- 
tion curves are regular. A better graphic procedure will be described 
next. Itis not only rather useful in practical situations but demonstrates 
a more general conception of the prediction problem. 

Preparatory to the application of this method, the frequency distribu- 
tions of Table 14.6 were combined in various ways as shown in Table 14.7. 
In this method we are interested in finding out from the data the proba- 
bility that an individual who earned a score of a certain size will be in 
the upper of two groups. In column (1) we have the total composite 
distribution. In column (2) we have the distribution of only those who 
received a mark of A. The probability of a student in any class interval 
on the examination receiving a mark of A is indicated by the proportion 
of all those in that interval who actually did receive a mark of A. This is 
an empirical probability, derived from the sample data. We use it as an 
estimate of the population probability. Not until we go down the column 
of frequencies in column (2) to the interval 160-169 do we find frequencies 
of a size that would give us much confidence in the accuracy of the pro- 
portion derived from them. In that interval, 5 out of 13 received an A, 
or a proportion of .38. In the interval 150-159, 7 out of 23, or 30 per 
cent, received an A. The other columns of the table represent other 
division points as to upper and lower marking categories. In columns 
(6) and (7) we are interested in the proportions in the class intervals 
receiving a mark of C or above. 

Figure 14.3 shows graphically the relation between these proportions 
and the various score levels. The midpoint of each interval is used to 
represent that interval: This figure demonstrates that the increase in 
probability of being in an upper of two categories on another variable 
(marks) is of an S-shaped form with different degrees of skewness. The 


skewness is related to the over-all proportion in the upper category and 


376 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 14.7.—FReEQuency DISTRIBUTIONS or ENGLISH EXAMINATION SCORES FOR 
STUDENTS RECEIVING MARKS ABOVE CERTAIN DIVISION PorNTS; ALSO PROPOR- 
TIONS IN EACH UPPER CATEGORY AT DIFFERENT SCORE LEVELS 


(1) (2) (3) | (4) (5) (6) (7) (8) (9) 
mos ti Sa ba Jo pat Save Pate Jarca Paved 
180-189 1 | 1 |(1.00) 1 |(1.00) 1 | (100) 1 |(1.00) 
170-179 2 1 | (.50) 2 (1.00) 2 | (.100) 2 |(1.00) 
160-169 | 13 5 .38 12 .92 13 | 1.00 13 | 1.00 
150-159 | 23 7 30 | 20 .87 23 | 1.00 23 | 1.00 
140-149 | 30 2 .05 | 28 .72 38 | .97 39 | 1.00 
130-139 | 66 2 .03 | 36 | .545 60] .91 65 985 
120-129 | 86. | o .00 | 40 465 79| .92 86 | 1.00 
110-119 | 119 1 .01 22 .185 103 | .87 116 .975 
100-109 |140 | o | .00 | 19 | 14 } 108) .77 | 136 | .97 
90- 99 | 123 0 .00 4 03 85| .69 114 93 
80- 89 | 97 1 01 43 | .44 89 | .92 
70- 79 | 56 0 .00 16} .29 45 -80 
60- 69 | 34 0 | .00 5] 15 25 | .73.5 
50- 59 | 17 o| .00 6 35 
40- 49 6 o| .00 1 | (.17) 
30- 39 3 0 .00 
20- 29 0 0 .00 
10- 19 0 

0- 9 1 


f= frequency in distribution of all students combined. 
jo = frequency in distribution of students receiving a mark of A. 
ba = proportion of students in each score interval who received a mark of A. Pro- 


Portions in parentheses are very uncertain owing to the extremely small samples from 
which they are computed, 


= Be ene oy a 
Ja = frequency in distribution of students receiving marks of A and B. 


to the skewness of the total distribution. With large numbers in the 
upper category the skewness tends to be positive and with small numbers 
the skewness tends to be negative. The points are sufficiently in line 
that one can draw continuous curves through them by inspection (which 
has been done in Fig. 14.3), except at the tails of some of them where 
data are incomplete. 

While we are interested primarily in the score level at which the proba- 
bility of an individual’s being in the upper category is exactly .5, it is 
important to note that these functions tell us much more than that. 
They tell the probability at each score level of an individual’s being in 


PREDICTION OF ATTRIBUTES 377 


the upper category. We can say that for a score of 120 there is apparently 
no chance of a student’s receiving an A, there are about 31 chances in 100 
of his receiving a B or above (with no chance of an A, this amounts to 
the odds for receiving a B), and there are about 89 chances in 100 of his 
receiving a C or better. There is possibly one chance in 100 of his failing 
the course. A student with a score of 70, however, has apparently no 
chance of receiving an A, or B, about 22 chances in 100 of receiving a C 
or better, about 77 chances in 100 of receiving a D or better, and con- 
versely, 23 chances in 100 of failing. 

To determine the scores corresponding to proportions of .5, by this 
graphic solution the division points appear to be: between A and B, a 


= 1.00 

S, 0.90 = 

poa A-D A-C We | | 

Y 3 ei x 

= 0.70 Paes VS us A 

8 060 £ D-F CE is- 

& 5 

R 0.50 z 

X 040 | — 

§ 0:30 | |_| 

3 

S 0.20 mi J 

g Ol Most probable i ` A ] 

& 10 eategory: Fj D C FE A 

o5 20 40 60 80 100 120 140 160 180 200 
Score in an English examination 

Fic. 14.3.—Proportion of the students who are in each higher letter-grade category at 
each score level in a common freshman-English examination. 


score point between 171 and 172; between B and C, a score point between 
130 and 131; between C and D, a score point between 86 and 87; and 
between D and F, a score point between 57 and 58. The last two coincide 
with those read from Fig. 14.2. The first is more accurately determined, 
though still rather uncertain. The estimate of a division of 130.5 between 
marks B and C differs considerably from the 124.5 that was read from 
Fig. 14.2. These comparisons alone tell us nothing about the accuracy 
of either method, except that they agree very closely (within 1 unit) on 
two and roughly on a third, with intolerable disagreement on the fourth. 

Before leaving the two graphic methods, it should be pointed out that a 
very important difference exists between them. In the first of the two, 
only two adjacent distributions are considered in determining the critical 
score that is to separate them. In the second, we consider all cases within 
the one letter-category distribution and all others above as being in the 


378 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


upper group and we consider all cases within the neighboring letter- 
category distribution and all others below as being in the lower group. 
This kind of problem comes up only when there are several division points 
to be established; more often there are only two. In the latter instance, 
all of the distribution in X is involved, just as it is in the second graphic 
method and as it is in the computational method to follow. Not only 
does the second graphic method provide more stable values to work with 
because of larger subsamples but it also follows better statistical principles 
as expressed in the development of the computational method. 

A Computation of the Critical Score.—It has been demonstrated recently 
that for this type of problem—predicting membership in one of two arti- 
ficial dichotomies—a formula may be used to estimate the critical score." 
We must assume for this purpose that both the distributions (in X and 
in FY) are actually continuous and normal. The formula is 

x= m, + (2) (yr) “anti sccragy spa (14-1) 
M, — M, 


Pq tion into two i ai ina 


correlated variable 


where M+ = mean of the entire distribution, for those in the two cate- 
gories combined. 

$ = proportion of the total population in the category having the 

4 higher mean score on X, 

q=1-p%. 

y = ordinate in the unit normal distribution at the point of 
division of the area under the normal curve with p pro- 
portion above it. 

z = standard measure of the point at which the division just 
referred to occurs. 

This normal distribution stands for the dichotomized variable in the same 
manner as it does in connection with the computation of a biserial r. In 
fact, there is a close relationship between formula (14.1) and the formula 
for computing a biserial 7 (formula 13.7). There is an alternative formula 
for estimating the critical score: 


i ay a (Alternative estimation of a criti- 
hg Eta (2) Gr =) cal division point) (14.2) 

The latter version of the formula is applied to the computation of Xe 
in the English-examination problem, with the work shown in Table 14.8. 
* This method was developed by the author and W. B. Michael and its derivation is 


described elsewhere: Guilford, J. P., and Michael, W. B. The prediction of categories 
from measurements. Beverly Hills, Calif.: Sheridan Supply Co., 1949. 


PREDICTION OF ATTRIBUTES 


The four division points by calcula- 
tion are 167.82, 130.2, 86.5, and 53.1. 
The second and third are within one 
unit of those found by the second 
graphic method. These findings, 
though very limited, suggest that 
the second graphic method may be 
superior to the first and that neither 
is very satisfactory unless there are a 
sufficient number of points on both 
sides of the .5 level to establish the 
proper location of the curve in the 
région of that important level. The 
labor involved in computation of X. 
by formula is probably no greater 
than that for the graphic methods 
and leaves nothing to guesswork. 
The graphic method does have one 
advantage, that it does not require 
any assumption about the distribu- 
tions on the two variables. 

Accuracy of Predicting Artificial 
Categories.—The evaluations of pre- 
dictions of categories when they are 
made from measurements can be made 
in a manner similar to those previ- 
ously described. Our interest may be 
in the numbers and percentages of 
correct predictions (or in the numbers 
and kinds of errors) and in the gain 
in accuracy of prediction from the 
new knowledge possessed. 

As an illustration, let us take the 
example of the English-examination 
data as related to course marks. To 
note the accuracy of prediction in 
two categories only, we may use the 
division between the B students and 
above and the C students or below. 
The indications are that the best 
separation on the score scale should 


LE — EF E ri p 
TABLE 14.8—WorkTABLE FOR THE COMPUTATION OF THE DIVISION POINTS ON THE SCORE SCALE OF THE ENGLISH EXAMINATION AS 


COMPUTED BY FORMULA 


2 
mk DNN 
= ° 2 `: 
Ss SBR 
=a 
wN 
za S NO o 
S SJs ağ 
= a Sats 
aia 
a 
w mane 
eos so 
EA ie eoaw 
Say See 
~ he | a eS 
= N 
© 
8 
S 
5 mah 
=> AN N 
s | oS an 
kad tN 
a 
wmo 
aweon 
Rees 
SG aa | Sag 
E477 
| 
mm oo 
NE-E 
= SABS 
S 2 CERESE] 
Sete lie i 
war 2 
-~ Fao 
Ss SRRA 
toma 
"| moon 
= anaes 
SI a aii 
To 
Lett 
eoonmm 
= oOtrm = 
Lae) a ANNDAN 
Neat onen 
oran 
= G TAa NE 
a = stato 
a Esper es, = 


(1) 
N 


Upper group 


19 


185 
576 
761 


379 


380 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


be between a score of 130 and one of 131. It is not possible to make an 
exact separation of the cases given in grouped form in Table 14.6, since the 
dividing score point comes within an interval. For the sake of applying 
the test of goodness of prediction, however, let us assume that the 66 
students are evenly distributed over the range 130-139, and that one- 
tenth of them would have a score of 130. This means about 7 students, 
4 of whom are in the A-B mark group and 3 of whom are in the C-D-F 
group. With these arbitrary, but minor, adjustments, we can arrange 
the entire sample of 826 students in a 2 X 2 distribution as in Table 14.9. 


TABLE 14.9.—Suamary or CORRECT AND Incorrect PREDICTIONS or LETTER MARKS 
A AND B versus C, D, AND F, IN FRESHMEN ENGLISH FROM AN EXAMINATION 
Score 

Examination Score 


Above 130 or All 
M 
aika 130 below scores 
E. NE 95 42 137 
O DF eris 90 599 689 
All marks. ...... 185 641 826 


Score group Prediction Sonen PSE Gent Men cent ti 
correct correct total group 
Above 130.......... AorB 95 51.4 16.6 
130 or below ..| C; Dror F 599 93.4 83.4 
Tatali S ounan], sae ia nme 694 84.0 


a S et S 


There are several ways of interpreting this table. We can note that 
there were 132 errors of prediction. If we are interested in predicting 
marks from scores, with the division point adopted we would wrongly 
elect 90 to receive marks of A or B and we would wrongly designate 42 
to receive marks of C, D, or F. In predicting the 185 who according to 
high scores should receive A or B we would be correct in 51.4 per cent 
of the cases. This does not seem very high accuracy, unless we compare 
it with the proportion of those with A and B marks in the entire group, 
which is 137/826, or about 16.6 per cent. In predicting the 641 to receive 
C or below, the accuracy of 93.4 seems very high until we realize that 
about 84 per cent of the entire sample received similar marks. In com- 
paring the percentages of correct predictions with the percentages of corre- 
sponding types of cases in the entire sample, we are going in the direction 


| 


PREDICTION OF ATTRIBUTES 381 


of the chi-square test, in which divergency of distribution in the row or 
columns from the distribution in the marginal frequencies is the indication 
of departure from a random situation. A more interpretable index of the 
degree of divergence is the phi coefficient. In this problem, chi square is 
208.11, which is far above required significance levels. From this we find 
¢ to be .50, which indicates the amount of correlation between marks and 
examination scores when both are dichotomized and used in that manner 
for prediction purposes. 

We could test the accuracy of prediction in similar ways for each of 
the other division points. The fourfold tables of frequencies would tell 
their own stories and would summarize the agreement between pre- 
diction and fact. The @ might vary somewhat from one division to 
another. In a multiple-category problem like this one, some might prefer 
to consider all five mark categories together and note, for each division 
point, how many errors in predicting marks are one-place errors, how’ 
many are two-place errors, and so on. A two-place error, for example, 
would be predicting a B when a D was obtained. A 5X 5 contingency 
table might be set up with the four critical scores as the division points 
between categories in variable X. In so far as the widths of categories 
on the score scale differ, a contingency coefficient, C, would be the sum- 
marizing index of correlation to use. 

The kind of study of errors of prediction will depend upon what informa- 
tion the investigator hopes to gain from the results. Whenever a proce- 
dure depending upon the counting of cases is used, it should be emphasized 
that rather large samples are needed for dependable comparisons. 

Locating a Critical Point in Predicting a Genuine Dichotomy.—When 
the dichotomy is genuine, the graphic methods that were previously 
described apply. The division is at the point of equal likelihood, and 
the graphic methods satisfy that principle for the sample. Assuming 
that the sample is representative of the population, approximately the 
same division point should be effective in making predictions in the 
population. 

An example of data that may be treated as a genuine dichotomy is 
given in Table 14.10.1 The two categories are “alcoholics” and ‘‘non- 
alcoholics” defined in the clinical sense. The alcoholics were recognized 
agencies as problem drinkers. It can be argued that there 
f tendency toward alcoholism, but clinically 
definite categorization which divides 


by responsible 
is a continuum of degrees o: 
and administratively there is a rather 


tion by M. P. Manson, A psycho- 


1 These data were adapted from a doctoral disserta 
Quart. J. Stud. Alcohol., 


neurotic differentiation between alcoholics and non-alcoholics. 


1948, 9, 175-206. 


382 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 14.10.—DISTRIBUTION oF ALCOHOLICS AND NONALCOHOLICS FOR SCORES ON AN 
ADJUSTMENT INVENTORY 


(1) (2) (3) (4) (5) (6) (7) (8) 
7 Basan Percentage 
i Scores Frequency distributions Prapor Aistnibutöns Propor- 
in = in- Bon : tion 
ventory z A ” 7 
Non- Alco- alcoholic| Non- Alco- alcoholic 
alcoholics| holics Both alcoholics} holics Boe 
66-71 0 1 1 (1.00) 0 0.5 0.5 (1.00) 
60-65 0 6 6 (1.00) 0 3.0 3.0 (1.00) 
54-59 1 13 14 93 0.7 6.4 Vail -90 
48-53 1 13 14 93 0.7 6.4 Weed 90 
42-47 3 17 20 -85 2:2 8.4 10.6 -19 
„ 36-41 3 33 36 -92 2.2 16.3 18.5 „88 
30-35 2 32 34 -94 1.4 15.8 17.2 92 
24-29 9 32 41 -78 6.6 15.8 22.4 705 
18-23 16 23 39 59 cH Oy j 11.4 23.1 A) 
12-17 | 36 24 60 -40 | 26:3: | 11.9 | 38.2 tl 
6-11,] 43 7 50 -14 31.4 3.5 34.9 -10 
0-5 | 23 1 24 04 | 16.8 OS | 17.3 -03 
N 137 202 339 -596 | 100.0 99.9 | 199.9 
M 14.11 | 32.83: | 25.27 14.08 | 32.80] 23.44 
e 10.41 | 13.93 | 15.61 15.45 
wE ¥ 


the two. When in doubt about continuity it is best to treat a dichotomy 
as being real. 

Inspection of the distributions in the table shows that the possibilities 
for prediction are quite promising. The first graphic method, based upon 
overlapping of the two frequency-distribution curves, with or without 
smoothing, gives a division point between scores 18 and 19. For any 
score of 19 and above we would expect to find more than half of the indi- 
viduals in this sample alcoholic and for a score of 18 and below less than 
half alcoholic. The second graphic method gives the same result as 
the first. 

Before accepting this solution as the one we want, however, it is neces- 
sary to consider a new aspect to the prediction problem when we are 
dealing with qualitative categories. Second thought about the alcoholism 
data will suggest the idea that the distributions as given represent the 
general population of men very poorly. In the general population, the 
proportion of alcoholics is extremely small; certainly not 60 per cent, as 
the data in question show. The data were obviously not selected on the 


PREDICTION OF ATTRIBUTES 383 


basis of stratification. In fact, for the purpose of the investigation, con- 
trasting groups of about equal size were desired. Suppose that we had 
alcoholics represented in line with their proportion in the general popu- 
lation. When we came to apply the first graphic method, with relatively 
much smaller frequencies in that group, the intersection of the curve with 
that for the nonalcoholic group would have been at a much higher score, 
if indeed it intersected at all. By the second graphic method, the pro- 
portions of alcoholics might have been less than 5 at all score levels. No 
solution by the principle of equal likelihood would then have been possible. 
Another type of solution is therefore called for; one less dependent upon 


1.0 
& 0.9 

3 0.8 i J 

8 0.7 | | alt y | 


$06 - 
805 

< 

è =| il} 

Ja ae 


Jez T 
fs daar Si 


= 
° 5 10 15 20 25 30 35 40 45 50 55 60 65 
Score on an adjustment inventory 


each score level on an adjustment inventory. 
bove which more than half have the property 


Frc. 14.4Proportion of alcoholics at 
The problem is to find that score point al 


alcoholic. 


he two kinds of individuals in the general population, 
if the principle of equal likelihood is to be applied. 

Assuming that we have qualitative categories, and that we are attempt- 
ing to predict one quality or another, it would seem logical to treat the 
two as being of equal importance. In the data of Table 14.10 we may 
regard the mean of 14.11 as being characteristic of nonalcoholics as a 
species, also the form of distribution they gave. This is true if there was 
no biasing of sampling within this group as such. Likewise, we may 
regard the distribution of scores for alcoholics as characteristic of their 
population. This suggests a solution which would allow the two species 


equal representation. To achieve equal representation we may convert 


the obtained frequencies into percentage frequencies. These appear in 
columns (5) and (6) of Table 14.10. Beside them, in column (7), are 


the proportions of tl 


384 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


given the sums of percentage frequencies in the different class intervals, 
and in column (8) are given the proportion of alcoholics at each score 
level. The graphic solution based upon these is shown in F ig. 14.4, which 
yields a critical division point between scores 20 and 21. F ollowing this 
approach we may say that with scores 21 and above the odds are greater 
than .5 that the individuals have the property of alcoholism and with 
scores of 20 and below the odds are less than .5 for this property. We 
will consider later how many and what kind of errors this division point 
would entail. 

When the two category groups are equated for size, as in the method 
just described, a much simpler solution is possible in certain situations. 
If the two distributions on the continuous variable are both symmetrical 
and of the same dispersion, the critical point will be at the unweighted 
mean of the two category means (M, and M,). This would be true, also, 
if with equal dispersions any positive skewness in the one distribution is 
compensated for by a like degree of negative skewness in the other. If all 
one wants is a division score and if these conditions are satisfied, the mean 
of the two means equally weighted will serve. , For the data on alcoholism, 
the mean of the two means is 23.44, This is somewhat higher than the 
critical point determined by the graphic method, because the two dis- 
tributions differ markedly in dispersion and in skewness. 

Computation of a Critical Value Dividing Genuine Dichotomies—With- 
out assuming any particular form of distribution for the continuous vari- 
able except that it be continuous, a critical value that will approximately 
satisfy the principle of equal likelihood may be estimated by the formula! 


7 i SPE i ( oy ) (Critical value on X divid- 14.3) 
<= Ma = -l ing cases into most a 
T Pq M,—M, probable categories) ( 


where M, = mean of all X values. 
$ = proportion of the cases in the category having the higher 
mean of X values. 
1—». 
mean of X values for category higher on X. 
= mean of X values for category lower on X. 
o*, = variance in the total distribution on X. 

Let us apply this formula to the prediction of sex membership of high- 
school students from knowledge of hand-grip scores. For a sample of 
171 boys and 246 girls, the two means (M, and M,) were 37.35 and 20.68, 
respectively. The mean of all cases combined was 27.51. The variance 


q 
Mp 
M, 


1 From Guilford and Michael, op. cit. 


L S 


a 


> e 


PREDICTION OF ATTRIBUTES 385 


of the combined group was 115.38. The proportions (p and g) were .410 
and 590. Applying formula (14.3), 


cre! 115.38 
he = BSE [Sl [sat = | 


1 09. \ /115.38 
= ahale (5) ( 156) 
= 27.51 + (.37205)(6.9214) 


= 27.41 + 2.58 
= 30.09 


This result tells us that students earning a score of 31 or above are more 
likely to be boys than girls; those with scores of 30 or below are more 
likely to be girls. 

An alternative formula requires less information. It reads 


a 5p a, 
X. = Ma + ( 7 ) Gr = > (Alternate to 14.3) (14.4) 


where the symbols are as defined previously. While this formula is more 
convenient in computing, formula (14.3) is somewhat more meaningful. 
It will pay to examine (14.3) to see what may be expected as p varies 
and as M, — M, varies. 

First, note that the critical score is the mean of all the X values plus 
an increment. This increment is positive and X, will be above the general 
mean when # is less than .5. It will be negative and X. will be below the 
general mean when ? is greater than .5. The division of cases in making 
predictions is in the same direction as that in the population. When 
p = .5, the increment becomes zero and the critical value equals M+. 
This fact is true regardless of the amount of correlation existing between 
X and the categories. When p deviates very far from .5, the ratio becomes 
quite large and likewise the increment. The critical value may even go 
outside the distribution, which would mean that we would predict all 
cases to be within the category having the greater frequency. If 90 per 
cent of a population, let us say, are in the upper category, Xe might go 
very low on the scale. If we predicted all, or nearly all, of the cases to 
be in the upper category, we would, of course, make a very small number 


of errors. 
Tt is of interest to consider the relation of the increment to the amount 


tion between X and Y. The type of correlation appropriate 
The point-biserial r is proportional to Mp — M, 
This being true, it appears that the 


of correla 
here is the point biserial. 
and inversely proportional to oz. 


386 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


increment is inversely proportional to the amount of correlation. The 
higher the correlation, the nearer X, is to the general mean, M,. When 
the correlation is perfect, predictions should ordinarily be perfect. For 
predictions to be perfect, the position of X, should be such that the pro- 
portion expected in the upper category coincides with p, the obtained 
proportion. As the correlation approaches zero, the critical value departs 
more and more from M, and assures the prediction of more and more 
cases in the more populous category. As 7: becomes zero, if p does not 
equal .5, the increment becomes very large and most predictions fall in 
the more populous group, if not all. Thus, the prediction is determined 
relatively more by knowledge of X when the correlation is large and by 
the knowledge of which category is more populous relatively more when 
the correlation is small, as we should expect. 

When Population Proportions Differ from Sample Proportions —Formulas 
(14.3) and (14.4) presuppose that the sample proportion is a good estimate 
of the population proportion. Application of the principle of equal likeli- 
hood depends upon this. In the case of the prediction of alcoholism from 
inventory scores, however, we know the population proportion of alcoholics 
is very far from the .596 that prevailed in the sample. In the general 
nonhospitalized population, the proportion might be less than 1 per cent. 
In a prison population or a hospital population, it would undoubtedly be 
greater than 1 per cent. In a psychopathic ward it would probably be 
even greater. How, then, should we apply the formulas? Will we want 
to observe the principle of equal likelihood under all situations? We saw 
some doubt cast on its application earlier. Let us apply formula (14.4) 
to the data on alcoholism, assuming different population proportions for ` 
alcoholic addiction; proportions of .333 (one-third), -2, .1, and .01, as well 
as the .596 of the Manson study and the special case of p=.5. We do 
not have data derived from such populations, but if we assume that the 
means and standard: deviations already found for the two categories of 
persons hold for the general situation, we can estimate M, and o?, for 
populations made up of the specified proportions. ‘The data are given in 
Table 14.11. 

For the obtained proportion of .596 for alcoholics, the X, which would 
give the maximal number of correct classifications is 20.08. For an 
assumed proportion of .50, X, is 23.47, which is equal to M, when the 
two classes are equal in size. This differs from the value estimated by 
the graphic method in Fig. 14.4, which was approximately 20:3. The 
two may be expected to coincide, as was suggested previously, when the 
two distributions have equal dispersions and skewness. They do not 
satisfy this condition here. If alcoholics made up a third of the popu- 


PREDICTION OF ATTRIBUTES 387 


TABLE 14,11.—Estmtatron OF CRITICAL DIVISION SCORES FOR PREDICTING ALCOHOLISM 
AS POPULATION PROPORTIONS OF ALCOHOLICS ARE ALLOWED TO VARY 


a, .5— p 
d Mz os |M,—M.z|Mp— Mz $ XW Xe 
(VY) (7) 
. 596 25.28 | 243.77 7.55 32.287 — 0.161 | — 5.20 20.08 
. 500 23.47 | 237.78 9.36 25.511 -000 0.00 | 23.47 
+333 20,35 | 214.78 12.48 17.210 + 0.500 | + 8.60] 28.95 
200 17.85 | 181.56 14.98 12.120 | + 1.500 | + 18.18 | 36.03 
- 100 15.98 | 148.47 16.85 8.811 + 4.000 | + 35.25 | 51.23 
010 14,30 | 112.69 18.53 6.082 +49.000 | +297.99 | 312.29 


lation in which predictions are made, the Xe should be at 28.95. If they 
made up only 1 per cent of the population, it would take a critical score 
of 312 to find the two kinds of individuals equally represented. This is, 
of course, well outside the practical range of scores. 

It is true that as the proportion of nonalcoholics increases, for the same 
critical score, 23, for example, the greater the numbers and percentages of 
mistakes (of the kind diagnosing nonalcoholics as alcoholics) that would 
be made, To réduce the number of mistakes one would move X, upward, 
as the results in Table 14.11 demonstrate. For practical use of the pre- 
dictive instrument, however, one would have to desert the principle of 
equal likelihood. Decisions then should be made taking into consider- 
ation the relative seriousness of the two kinds of errors. The principle 
of equal likelihood carries the implicit assumption that the two kinds of 
error are of equal importance. 

Effectiveness of Predictions in Genuine Dichotomies—The goodness of 
prediction of the type being discussed here can be evaluated in much the 
same manner as for the prediction of artificial categories. This is true, 
particularly, when there are stable and meaningful population proportions 
in the two categories. In view of the several qualifications mentioned 
above, however, the kind of evaluation will have to be adapted to fit the 
situation and to give the most meaningful and pertinent conclusion. The 
point-biserial 7 is a general index of correlation that applies here. It will 
not give the kind of answer often desired in this connection. With a 
given critical value chosen for X, we have a fourfold contingency table, 
to which other tests, as described before, apply. 

Exercises 


1. Using Data 144, make predictions in both directions. Determine the percentages 
of correct predictions with and without knowledge of categories and the percentage of 
forecasting efficiency. Discuss the results, including the usefulness of the predictions. 


-388 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


2. Using Data 14B, make predictions of whether a student will report “Yes,” “?” or 
“No” to the question about talking when he makes similar responses to the question 
about walking in his sleep, and vice versa. What are the percentages of accuracy in 
these various predictions and in the over-all set of predictions? 

3. Apply the cell square contingency test to Data 14B, testing predictions from 
different sources. Make any combinations of categories that seem necessary. Com- 
pute chi square for the entire table. Draw conclusions. 

4. Find a critical total score which will subdivide the total group in Table 13.4 into 
the most probable categories (passing and failing). Use two graphic methods and a 
solution by formula. Discuss any discrepancies that may occur. 

5. Find a critical division point between boys and girls for the data in Fig. 15.1, 
which will make the best prediction of sex membership from knowledge of weight. Use 
formulas (14.3) and (14.4). Evaluate the results of prediction in any way that seems 
most informative to you. 


Data 144.—RELATIONSHIP BETWEEN FAILING IN COLLEGE AND BEING ABOVE OR 
BELOW THE MEDIAN IN HIGH-SCHOOL GRADUATING CLASS 


Failing in one | No failures in 


Status in high-school class Total 
or more courses] first semester 
Above the median............. 37 340 377 
Below the median............. 49 71 120 
86 411 497 


St ee | a 


DATA 14B.—ReELATIONSHIP BETWEEN WALKING IN One’s SLEEP AND TALKING IN 
One’s SLEEP AS REPORTED BY 1,787 STUDENTS* 


Walk in your sleep? 


Talk in your sleep? 


Yes ? No Total 

ee eee OE 88 9 400 497 
E 3 14 194 211 
INO wigs aroen 7 3 1,069 1,079 
nua a 98 26 1,663 1,787 


* Jenness, A. F., and Jorgensen, A. P. Ratings of vividness of imagery in the waking state 
compared with reports of somnambulism. Amer. J. Psychol., 1941, 54, 253-259. Reproduced 
with the permission of the editor of Amer. J. Psychol. 


CHAPTER 15 
PREDICTION OF MEASUREMENTS 


PREDICTING MEASUREMENTS FROM ATTRIBUTES 


The Principle of Least Squares.—What would be the most accurate 
prediction of the weight of a sixteen-year-old youth? By “most accurate” 
we mean a weight that, if chosen to predict the weight of each sixteen-year- 
old selected at random from a certain population, would be closer to the 
facts in the long run than any other estimate would be. To state the 
matter in another way, we want a predicted weight that would give us 
the smallest average discrepancy from the actual weights. For every 
person, we should find the difference between his actual weight and our 
prediction in order to obtain the single discrepancy. 

Statisticians have good reason to deal here in terms of the squares of 
the discrepancies rather than in terms of the discrepancies themselves. 
They demand a predicted measurement from which the sum of the squared 
discrepancies is a minimum. The prediction that will satisfy this require- 
ment has been proven to be the mean of the distribution: In choosing 
the mean as our prediction, we are following the principle of least squares. 
Whereas in predicting attributes we chose the mode of a distribution as 
the indicator that would give us the smallest percentage of error of place- 
ment of cases, in predicting measurements, we choose the mean as the 
indicator, which gives us the smallest set of squared deviations from the 
predicted value. 

Predictions Apply to Selected Populations.—In answering the question 
with which we started this discussion, the best prediction of the weight 
of a sixteen-year-old, any better knowledge being lacking, is the mean 
weight of the population of which he isa member. If we wanted this to 
cover all sixteen-year-olds, we should see to it that our distribution from 
which we derive our mean is made up of a large sample in which both 
sexes, all races, and all socioeconomic and geographic groups are pro- 
portionately represented. We might, however, confine the question to 
sixteen-year-olds from the United States. We might further confine it 
to high-school youths in one city, or, even further, to one particular high 
Whatever our restriction in population, the predicted weight will 
pt by chance) to that kind of population. In fact, 

389 


school. 
apply only (exce 


390 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


strictly speaking, it will apply only to the measured sample. Lani 
extend our predictions to samples beyond our known population, we 
ed do so at the risk of enlarging errors of prediction. A a . 
Errors of Prediction Measured by the Standard Deviation — ne ce’ 

tain high school in a certain American city, a random sample of 51 jaca 
year-olds had weights distributed as shown in Fig. 15.1. ; For the sake o 
an illustration, we shall adopt the sixteen-year-olds in this high school as 
our population. What we say concerning predictions within this group 
will hold by analogy to larger, more inclusive populations, i The mean of 
the 51 students’ weights is 61.9 kg. and the standard deviation is 13.2. 


100 


Boys Girls Both 


Fic. 15.1.—Distributions of sixteen-year-old high-school boys and girls for weight in 
kilograms, Each dot represents one individual. 


If now the 51 students were listed in alphabetical order and without see- 
ing them we used merel 


nearly predict the actual 
“61.9 kg.” The odds are about 2 to 1 
that our errors would be 


predicted weight. The o of 13.2 kg. ma: 


quares. We 
prediction in this instance, 


and for practical purposes of making decisions for individuals where their 


PREDICTION OF MEASUREMENTS 391 


weights are important factors, we should be seriously in error in many 

cases. But we could do less well in predicting the individuals’ weights if 

we did not even possess the knowledge of their mean. Even if we knew 

the mean of sixteen-year-olds in general and used that as our predictive 

value, we should do worse than we did, unless the mean of this small 

population coincides with that of all sixteen-year-olds. In other words, 

by knowing one attribute of our population—a group in one American 

high school—and the mean that goes with that attribute, we reduce the , 
error of prediction to some extent. 

Predicting Weight from Knowledge of Sex.—Of the 51 cases in the 
population of sixteen-year-olds, 24 were boys and 27 were girls. Will it 
help to predict more accurately if we know each individual’s sex? It 
should, since there is a sex difference in weights. Though many girls are 
heavier than many boys, the averages are distinctly apart—67.8 for the 
boys and 56.6 for the girls. Using the attribute of sex to contribute 
toward the prediction of individual cases and following the principle of 
least squares, for each boy who came along we should predict his weight 
to be 67.8 kg., and for each girl, the prediction would be 56.6 kg. 

How much will predictions now be improved? The margin of error of 
predictions for boys is given by thee of their distribution, which is 12.6 kg., 
and the margin of error for the girls is given by ao of 11.3, From this 
information, we see that both boys’ and girls’ weights are more accurately 
predicted than before (when the margin of error was 13.2) and that the 
girls’ predicted weights are more free from error than are the boys’. 

As a matter of consistency with previous procedures, let us ask what 
the percentage of reduction in error of prediction is. For the boys, the 
change of .6 in the g is 4.5 per cent, and for the girls, the change in g is 
1.9, or 14.4 per cent. 

The Standard Error of Estimate-—There is a way of summarizing the 
margin of error for all cases combined. This requires the computation 
of a standard error of estimate. Itisa kind of summary of all the squared 
discrepancies of actual measurements from the predicted measurements. 
In terms of a formula, the standard error of estimate is 


Cyn = tees (Standard error of estimate) (15.1) 


where Y = measured value of a case we are trying to predict. 
Y’ = predicted value for the case. 


N = total number of cases predicted. 7 f : 
The subscript in yz tells us that we are predicting variable Y from vari- 
able X. In the illustrative problem, Y is the variable of weight, and X is 


392 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


the variable of sex difference. The sum of the discrepancies squared (see 
Table 15.1) is 7,288.1; so 


e _ 7.2881 _ 
oye = E = 142.90 
dyz = 11.9 


The standard error of the estimate, in predicting weight on the basis of 
knowledge of sex, is 11.9. Using only the knowledge that this is a par- 
ticular group of sixteen-year-olds with a mean of 61.9, the error of esti- 
mate was given by a standard deviation of 13.2. The margin of error 
using the information supplied by sex difference is 90.2 per cent as large 
as that without using this information. The reduction in size of error of 
prediction is 9.8 per cent, which is rather small but re 

In computing the standard error of estimate in 
it is probably more natural to do so by finding t 
distributions separately and then combining them. 
bined directly by simple addition or averaging, 
ations in the two groups that must be combined. 
deviations in each distribution can be found by 


presents some gain. 
this kind of problem, 
he o’s of the two part 
They cannot be com- 
It is the squared devi- 
The sum of the squared 
the formula! 


—— 2 (Sum of square of discrepancies within one distribu- 
2x, = aT“a ) 


tion (15.2) 
where 2x’, = sum of the squared discrepancies between prediction and 
fact (or between measurements and the mean) in distribu- 

tion A (one of the attribute distributions), 
Na = number of cases in distribution 4, 
Sa = standard deviati 
When these sums of squared devi: 


y may be combined by 
In other words 


6 aa yy: = SN (Sum of squares 


tributions) a discrepancies in all di (15.3) 


where Nz = number of cases in any component distribution (distributions 
A, B, C, etc., in turn). 

ox = standard deviation o 

The work of computing X(Y 

sixteen-year-olds may be summa. 


computation of cyz 


f the same distribution.? 


— Y'}? for the problem on weights of 

ë rized as in Table 15.1. From here the 

18 exactly the same as Previously demonstrated. 

1 Cf formula (5.8). 
*It will be recognized that 3(¥ — y’)? jg essentially a sum of squares from which 

the within variance would be computed in analysis of variance (see Ch, 10), 


PREDICTION OF MEASUREMENTS 393 


TABLE 15.1.—SUMMARY OF THE COMBINATIONS OF SUMS OF SQUARES FROM DIFFERENT 


SUBSAMPLES 
Distribution Nk o g? No? 
DA a os a 24 12.65 160.02 3,840.48 
T 5. aa 27 11.30 127.69 3,447.63 
7,288.11 
x(Y — Y’)? 


Other Predictive Indices May Be Introduced.—It should be added that 
other attributes may be brought into the predictive picture. For instance, 
if different glandular constitution has a definite bearing on body weight, 
for example, thyroid functioning, we could subdivide each sex group into 
two or three categories as to glandular condition. The mean of each new 
subgroup would then become the prediction for members of that group. 
The deviations of actual weights from these means would be smaller and 
the new standard error of estimate would be reduced in size. 

If we were successful in singling out all the significant factors correlated 
with weight and could predict from all of them at the same time, theoreti- 
cally we could reduce errors of prediction to approximately zero. We can 
probably never know what all the significant factors are from which weight 
can be determined, and if we did it might be impossible to assign all the 
attributes to each individual. We are here speaking of the hypothetical 
limiting case. Any improvement in predictions approaches that limit. 
From a practical standpoint, it is always a question of whether the trouble 
of uncovering and using new descriptive attributes is justified by the gains 
in predictive accuracy that result. 

Estimation of Errors of Prediction in the Population.—The standard 
error of estimate computed for the weight-prediction problem, strictly 
Speaking, applies to the sample only. It is a biased estimate of the 
margin of error that would occur in making predictions beyond this par- 
ticular sample but in the same population. To estimate the standard 
error of estimate for the population, we need, as usual, to consider degrees 
of freedom, unless the sample is large. The formula would be the same as 
(15.1) with the substitution of N — m for N, where m is the number of 


categories predicted from. 


Mi (Standard error of estimate corrected for bias) (15.4) 


i le N-m 


With this formula applied instead of formula (15.1), the corrected stand- 


394 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ard error of estimate is 12.2 rather than 11.9. The corrected one is the 
more realistic one to use in making predictions outside the sample. 


PREDICTING MEASUREMENTS FROM OTHER MEASUREMENTS 


When both known and predicted variables are measured on linear scales 
and there is some relation between them so that predictions are possible, 
we have a much more complicated problem. A complete treatment of it 
involves correlation methods, regression equations, and other procedures. 

The Correlation Diagram.—Our illustration of this kind of problem 
consists of two achievement examinations in a course on educational 
measurements. In Table 15.2, we have the two distributions grouped in 
class intervals and the measurements in each class interval broken down 
to form a distribution of its own in the other test. 
test X are listed along the top of Table 15.2 and 
test Y are listed along the left margin. 

Prediction of Y from X. 
problem; the prediction of 
vice versa. Let us conside: 


The class intervals for 
the class intervals for 


—As usual, we have here a double prediction 
a score in Y from a known score in X, and 
r the prediction of Y from X first. For the 
individuals in any class interval in test X, the best prediction is the mean 
of the F distribution in that column, in other words, the mean of the 
‘column (M,). For each column of Table 15.2, its mean is listed in the 


TABLE 15.2.—Prepicrinc SCORES IN One TEST FROM KNOWN SCORES IN ANOTHER 


TEST 
Test Y 
60-64/65-69|70-74|75-79/80-84|85-89 90-9495-99] f, | Mro | crow 
135-139 1 1| 97.0 | —* 
130-134 A EE |e 3 | 83.7 | 6.61 
125-129 1 0 ane 1 4 | 85.8 | 5.45 
120-124 P| A) 2") 6 I 17 | 83.2 | 5.67 
115-119 TTSA 875 eoe Neg 22 | 78.6 | 5.72 
110-114 Be Peel) 22. Oe aaa 22 | 75.9 | 6.56 
105-109 1 1 2 5 1 10 | 74.0 | 5.56 
100-104 salh aa aE e 6 | 70.3 | 6.87 
95- 99 2 2 | 67.0 | 0.00 
fe 3 | 10 | 12° 26] as | a S | 1 srew 
Me 170/105. 5}114. 91114. 51116. 41120. sl124.olt37_0 
Ce 4.08 |5.52 14.31 6.83 |6.43 4.71 15.10 | —* 
` The standard deviation of this array s Indeioraibgigy > 
next to last row. For the first column, M, is 107.0. Any person receiving 
. a score from 60 to 64 inclusive in test X will most probab y earn a score of 


PREDICTION OF MEASUREMENTS 395 


107.0 in test V. The other means of the columns are similarly interpreted. 
It will be noticed that there is a general upward trend in the M,’s as we 
go up the scale in test X, though there are two inversions. In view of 
the small numbers of cases upon which 


these means are based, some inver- 1o 
Sions are not surprising. 
5 i ee o 
The margin of error in predicting Y 130 


from X in each column is indicated by i 


the standard deviation of that col- 
umn. The o,’s are listed in the last 
row of Table 15.2. They remain 
fairly constant, but the range is from 
4.08 to 6.83. The significance of the 
variations in e, could be examined by = 
making F tests (see Ch. 9). 

The entire picture of predictions 

and thei 7 pre 90. 

their margins of errors within 60 10 80 90 100 
columns is shown graphically in Fig. Scores in test X 
152. “Lhe dira h I ate Fic. 15.2—A chart showing the most 

"D e circlets show the positions probable score in test Y corresponding 
of the column means, and the vertical to each moon score in isi x also 
li . 4 the range etween minus ani plus one 
na running through them extend standard deviation within each column. 

rom —ic, to +1s.. In each column, ` 
we expect two-thirds of the observed scores to lie within the limits of 
these lines. 

Standard Error of Estimate.—In order to obtain a single indicator of 
the goodness of the prediction of Y scores from X scores, we may compute 
a standard error of estimate as we did before when predicting measure- 
ments from attributes. ‘The work is best organized as in Table 15.3. 


Ss 
O 


Scores in test Y 
= 


S 
paean i 


& 


TABLE 15.3.—COMPUTATIONS OF THE STANDARD ERROR OF ESTIMATE OF V SCORES FROM 


X Scores 

Ne o?e Neo? 
3 16.67 50.01 
10 30.45 304.50 
12 18.58 222.96 
26 46.63 1212.38 
18 41.36 744.48 
12 22.22 266. 64 
5 26.00 130.00 
z 2930.97 

x(¥ — Y’) 


eS Le 


396 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


i ' f cases in that column. 
ery column, we list first N., the number of 
ant mE list o°., the squared o of the distribution in that column. Next 
we find the product of these two values for that column. The sum of 
these products for all columns yields =(Y — Y’)?, which we need for 
computing cyz. This sum is 2,930.97. From here on the work follows 
formula (15.1). 


a _ 2,930.97 _ 
oye = 7 = 33.6803 
yz = 5.80 


The o of the entire distribution of Y scores is 7.85, so that there is a 
reduction in variability of 2.05, or 26.1 per cent, a marked improvement 
in prediction, as such tests go. We may say that the forecasting efficiency 
for predicting Y scores from X scores 
as we did is approximately 26 per cent. 

Predicting X from Y—The predic- 
tions of X from Y are listed in Table 
15.2 under M,ow in the next to the 
last column. The most probable X 
score for any interval of Y scores is 
the mean of the row. The margin of 
error of the predictions js given in 
each case by oyow, and these appear 
in the last column of Table 15.2. To 
complete the picture of thése predic- 
tions and their o’s, Fig. 15.3 is pre- 
100 sented. The standard error of esti- 
mate of the X Scores, oz, (note the 


probable score in test X for each mid- order of x and y in the subscript), is 


pae score in test Y, also the range equal to 5,93, 
etween minus and plus one standard the 
deviation within each row. 


Scores in test Y 


Scores in test X 
Fic. 15.3.—A chart showing the most 


cent) in predicting Y from X1 
The procedure for predictions b 

not used very much in practice. 

principles it illustrates; principles 


y using means of columns and rows is 
It-was emphasized here because of the 
that underlie the regression methods to 


PREDICTION OF MEASUREMENTS 397 


be described next. The reader will find that the main principle for making 
predictions of measurements still holds—the principle of least squares. 
He will also find that the principles for testing accuracy of prediction— 
the standard error of estimate and the percentage of reduction of errors— 
also still apply. New ways of estimating them will be shown and their 
relation to the coefficient of correlation will be explained. In addition, 
new ways of interpreting the usefulness of predictions will be demonstrated. 


REGRESSION EQUATIONS 


The Meaning of a Regression Equation.—The main use of a regression 
equation is to predict the most likely measurement in one variable from 
the known measurement in another. If the correlation between Y and X 
were perfect (with a coefficient of +1.00 or —1.00), we could make pre- 
dictions of Y from X or of X from Y with maximum accuracy; the errors 
of prediction would be zero. If the correlation were zero, predictions 
would be futile. ` Between these two limits, predictions are possible with 
varying degrees of accuracy. The higher the correlation the greater the 
accuracy of prediction and the smaller the errors of prediction. 

When we use the means of columns of a scatter diagram as the most 
probable corresponding Y values, we are actually predicting Y’s only for 
the midpoints of intervals on X, or stated in another way, we are pre- 
dicting the same Y value for a certain range of values on ¥. If we have 
any desire to be more accurate than that, we should like to be able to 
make predictions for all values of X. This the regression line and the 
regression equation enable us to do. À 

| We found (see Figs. 15.2 and 15.3) that the means of the columns (and 
of the rows) tended to lie along a straight line, with some minor deviations 
from strict linearity. We shall now assume that the best predictions of 
Y from X lie along a line that best fits the means of the columns when 
those means are weighted according to the number of cases represented in 
each one. This is known as the Zine of best fit, or the regression line. 
When predicting X from Y, we have another such line for the regression 
of X on F. ) The two regression lines for the achievement-test data will 
be found pictured in Fig. 15.4. (Only when a correlation is perfect will 
the two lines coincide throughout their lengths. The higher the corre- 
lation, plus or minus, the closer together they tend to lie.) All such pairs 
of regression lines intersect at the point representing the means of Y 
and X; in this case, they cross at X = 78.15 and Y = 115.28. 

The Regression Equations and Regression Coefficients—From ele- 


e student should remember that the equation for a 


mentary algebra, th 
an equation com- 


straight line, in general form, is Y = @ + 6X. Such 


398 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


pletely describes a line when a and b are known; they are. the regression 
coefficients and must be obtained from the data we have. Deel: Une 
of account for the moment the coefficient a, we should have Y = bX, or 
Y equals b times X. We see from this that b is a ratio, and it tells us 
how many units Y is increasing for every increase of one unit in x. If b 
were 2, then for every unit of increase in X, Y increases two units. If 
b = 0.5, then for every unit increase in X, F increases a half unit, The 


b coefficient gives us the slope of the tegression line, and it depends upon 


X=0591Y +1002- 
is mirt 


55 60 65 70 75 80 85 90 95 100 105 
Scores in test X 


Fic. 15.4.—A scatter diagram for two examinations, with two regression lines represented 
and their equations. 


the coefficient of correlation a 


nd the two standard deviations, as in the 
formula 


Che tae ©) (Coefficient for linear regression of Yon X) (15.5) 


where b;z, with the subscripts in that order, im 
Y from X, and where this is also true for 7,,.1 

When we want to predict X from F, w 
equation with a different b, which is given 


plies that we are predicting 


e have a different regression 
by the formula 


Oz 


bay = Tay (=) (Coefficient for linear regression of X on F) (15.6) 


The coefficient of correlation is, of course, numerical 
cases, since fys = rz. But in each case, the 3’ 
equal to 7 times the ratio of the standard deviati 


ly the same in both 
s are different jand are 
on of the predicted vari- 
1 For a derivation of formulas for finding regression Coefficients, see Appendix A, 


PREDICTION OF MEASUREMENTS 399 


able to that of the variable predicted from. We frequently speak of the 
predicted variable as the dependent variable and of the one predicted from 
as the independent variable. `The reason for this is that in predicting Y 
from X, we arbitrarily take any value of X that we wish at the moment, 
whereas the Y we predict from it is dependent upon what X we have 
chosen. Once we have picked out a certain X, Y is immediately fixed by 
our regression equation. 

The regression coefficient a is merely a constant that we must always 
add in order to assure that the mean of the predictions will equal the mean 
of the obtained values. As by- determines the slope of the line, @yz deter- 
mines the general Zevel of the line. It is given by the formulas 


Gz = My — (Mz)byz (The a coefficient in a linear regression equa- (15.74) 
Qn = M,—(M,)b * Y (15.7b) 


where the first one concerns the equation for the regression of Y on x 
and the second concerns the equation for the regression of X on Y ) 

‘The derivation of the entire regression equation is more often accom- 
plished by one composite formula, combining the derivations of a and b 
into one operation as follows: 


Yi =r (2) (x —M)+M, (15.82) 
Or, 


(Complete statement of linear re- 
gression equations) 


esr (e) (Y —M,) +M: (15.8b) 
Y, 
We use Y’ and X’ here rather than Y and X to show that they are pre- 
dicted rather than obtained values. Predictions and obtained values 
rarely coincide unless correlations are nearly perfect. 
Applied to the data of Table 15.2, we have 


7.85 ” 
B = — 78. .28 
F .61 a (X — 78.15) + 115 


(.61)(1.03)(X — 78.15) + 115.28 
630X — 49.23 + 115.28 
:630X +°66.05 


XxX’ = 1 CS) (¥ — 115.28) + 715 


M 


Il 


7.85, 
= 591Y + 10.02 


we may say that V’ increases .630 units 


Interpreting these equations, 
that X’ increases .591 units for every 


for every unit increase in X and 


400 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


unit increase in Y. One way of checking the accuracy of the solution of 
regression equations is to substitute M, in the first one to see whether 
Y’ is the mean of the Y’s and to substitute M, in the second to see whether 
we obtain M+ as our prediction of X. 

Another check as to the accuracy of computation of the b coefficients is 
the equation 


bysbey = r? (Relation of regression coeflicients to r*) (15.9) 


In other words, the product of the two b coefficients is 
of the coefficient of correlation. In this instance 


(.630)(.591) = 3723 = .612 


The Concept of Regression.—It may hel 
sion equations as given in formulas (15.8a 
at their origin. The idea of regression 
method followed. It began with Sir Fr. 


equal to the square 


p in understanding the regres- 
) and (15.85) to take a glance 
came first and the correlation 
ancis Galton, who was making 
the theories of evo- 


of their parents, he be; 
first. In order to put parents and their children on 
scale, he converted all heights to standard scores, 

knows, this meant expressing each person’s height 
ation from his group mean to the standard deviat 
persion. The unit for the offspring’s scale and als 
was then one ø. Figure 15.5 shows the type of fi 


Galton next computed the means of offsprin 
corresponding to certain fixed 


iagram; perhaps the 
a common measuring 
As the reader already 
as a ratio of his devi- 
ion of that group dis- 
o for the parents’ scale 
gure Galton drew. 

g’s heights (in z scores) 
in z scores), As we saw 


, he found that t 
along a straight-line trend, To him, incidental 


did not increase idl 
as did the parents’ heights, F i parr PIRA 
from their general mean tha 
came deviated from their 


PREDICTION OF MEASUREMENTS 401 


Origin of the Coefficient of Correlation.—Galton wanted a single value 
which would express the amount of this regression phenomenon in any 
particular relationship problem. Karl Pearson solved the problem in 
terms of the formula to which his name is attached. The steps were 
somewhat as follows. Galton’s own idea was to use the slope of the 
regression line as the index of relationship, because the steeper the slope, 
the closer the agreement between two variables. The slope of the regres- 
sion line in Fig. 15.5, as in any coordinate plot, is the ratio of the increase 
in Y corresponding to a certain increase in X. From the plot we see 


Height of offspring 
& 
t 
o 


-30 -20 -lo ? +o +20 +30 


Mx 
Height of parent 


Fic. 15.5.—Diagram showing the relation of the Pearson prod 


uct-moment coefficient of 
correlation to the slope of X and Y are in standard- 
Score units. 


that as X changes 20 (from the mean to +2c, as shown), Y changes 
only 1c. The slope is 14 or .5. This was Galton’s coefficient of regres- 
sion, which received the symbol r for that reason. That symbol has 
remained, The Pearson r is the slope of the regression line when both 
F and X are measured in standard-deviation units. In this case, it can 


be shown that 


the regression line when scores in both 


_ ŽZyžz (Pearson 7 from standard measures) (15.10) 


Tja = N 


In other words, r is an average of all the cross products of standard 


measures. 
Derivation of the Regression Equations.—Since r is the slope of the 


regression line when standard measures are used, the equation for this 


402 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


situation is 
Sut = Ty hs (Regression equation with standard measures) (15.11) 
y! = Tyke 


Here we use zy with the prime to denote a predicted value as distin- 
guished from the actual value. From this beginning, let us work toward 
the regression equations in raw-score form (formulas 15.8a and 15.80). 
The next step is to express these standard measures as deviations, y’ and x. 
Since z+ = x/oz and zy = y'/o, (oy is the unit of the Zy values as well as 
of the z, values), the equation becomes 


eg (15.12) 
oy oz 
If we multiply this equation through by o,, we have 
y =e (®) 2, (Regression equation with deviation scores) (15.13a) 
Sr, 
or 
Y = byw (15.138) 


Equation (15.13%) shows that the same b coefficient applies to deviation 
Scores as that applying to raw scores (see formula 15.82). It also shows 
that since the means of x and y are zero, the regression lines will pass 
through both of them without having an a coefficient in the equation. 
One more step is needed to arrive at the raw-score type of regression 
equations. Going back to equation (15.12), if we next convert x to its 
equivalent, X — M., and y’ to its equivalent, Y’ — M, (M, is the mean 
- of the F’ values as well as of the y values), we have 


y'— M, X-M. 
= gs mae Cz ) (15.14) 


Multiplying through by o,, we have 


Y' — M, =n. E) (X — M). 
And transposing M,, 


EEr («) (X-M)+M, 
which is identical with formula (15.8). 


Regression Coefficients from Ungrouped Data—When data have not 


been grouped in class intervals, the derivation of the b coefficient requires 
another formula, which reads 


PREDICTION OF MEASUREMENTS 403 


NZXY — (2X)(ZY) (Regression coefiicient directly from (15.15) 


byz = NSX? — (2X)? data) 


When this formula is applied to the data in Table 8.3, we have 


p, = 4720 — 4,550 
v= = 6 240 — 4,900 
_ 170 
=r 
= 127 


The a coefficient is obtained by means of formula (15.7a) and is solved as 
follows: 
ayz = 6.5 — (7.0)(.127) 
= 6.5 — .89 
= 5.61 


The regression equation is therefore Y’ = 5.61 + .127X. The equation 
for the regression of X on Y can be obtained by similar operations, sub- 
stituting Y for X,-and vice versa, in formula (15.15). The solution for 
the illustrative problem is 


_ 4,720 — 4,550 


ba = 5330 — 4,225 


end 
7. 


— (6.5)(.154) 
= 10 


Gzy 


Checking the b coefficients, byzbzy = (.127)(.154) = .0196 = 7°, which is 
in agreement with 7? as previously known (see Table 8.3). 

Predictions from Regression Equations.—As an illustration of how a 
regression equation is applied in prediction, let us assume some values 
of X and find the corresponding Y’ values. Because in the preceding 
methods of prediction we predicted Y 's corresponding to midpoints of the 
intervals of X, let us do the same here for the sake of comparison, remem- 
bering that we might have chosen any values of X that we pleased. 
Table 15.4 gives the X values and their corresponding Y’ values. When 
X is 62, Y’ is 105.1, and when X = 97, V = 127.2, etc. Itis interesting 
to compare these particular predictions with the means of the columns, 
which are given in the third row of Table 15.4. The discrepancies will be 


404 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


TABLE 15.4.—Prepictions oF Y FROM X AND X FROM Y BY MEANS oF REGRESSION 
EQUATIONS* 
Y’ = 0.630X + 66.05 


62 67 72 77 82 87 92 97 
105.1 | 108.3 | 111.4 | 114.6 | 117.7 | 120.9 | 124.0 | 127.2 
107.0 | 105.5 | 114.9 | 114.5 | 116.4 | 120.3 | 124.0 | 132.0 


X’ = 0.591 Y + 10.02 


107 | 112 | 117 | 122 | 127 | 132 | 137 
67.3 | 70.3 | 73.3 | 76.2 | 79.2 82.1 | 85.1 | 88.0 | 91.0 


75.9 | 78.6 | 83.2 | 85.8 | 83.7 | 97.0 


* The data involved are from the two exam 


inations correlated in Table 8.5. 
columns and rows are obtained from Table 15,2, 


The means of the 


found very small as a rule. 
not very reliable because of 
the Y’ predictions because 


Granting that the column means are generally 
small samples, we may feel more assurance in 
they are determined from the trend of the 
entire data rather than by small samples in separate columns. The pre- 


dictions of X’ from Y are given in the second section of Table 15.4 and 
are compared with the means of t 


As a practical means of 


most suitable procedure. If the regression lines are drawn as in Fig. 15.4 


on the base line, one can follow 


egressions. Another point for the 
= 60, Y = 103.85; anda third point, 
X = 100 and y = 129.05. For the 


points might be located conveniently at F = 100, 
X = 69.12, and Y = 130, X = 86.85. 


between observed values and predic 
standard error of the estimate from 


pancies 
There we computed the 


the discrepancies themselves; here we 


basis of regr 


estimate the margin of error of prediction, as 


PREDICTION OF MEASUREMENTS 405 


from the coefficient of correlation. The formulas are 


Oyz = Cy VI — yz (15.16a) 


and (Standard error of estimate computed fromr) 
Ou =o2V1 — ray (15.162) 


in both of which the terms are now well known. It will be seen that the 
two equations are the same except for the use of oy when we are predicting 
Y and of cz when we are predicting X (for ryz = Ta). The two standard 
deviations are multiplied by the common factor »/1 — r°. This factor is 
always less than 1.00 and gines us an estimate of the reduction in errors of 
prediction from knowledge of correlated measurements as compared to errors 
of prediction without that knowledge. When r is zero, this element equals 
1.00, and then oyz = oy and ozy = oz. In other words, when r = 0, there 
is no basis for prediction. When r = 1.0 (or —1.0) the element reduces 
to zero, and so does the standard error of estimate. This coincides with 
the expectation that the margin of error of prediction is zero when the 
correlation is perfect. 

Interpretation of an Obtained Standard Error of Estimate-—The inter- 
pretation of the standard error of the estimate when r is neither zero nor 
1.00 is somewhat as follows. Like any standard deviation, dyz can be 
referred to the normal curve of distribution. For the examination 


problem, 
yz = 7.85 V/1 — 3721 = 6.22 
oxy = 7.60 V/1 — .3721 = 6.02 


No matter in what part of the measuring scale we are predicting (within 
the range of obtained scores, naturally) we assume that the margin of 
error is the same. When we predict Y from X, the average dispersion 
of observed measurements about Y’ is given by a ø of 6.22. We expect 
two-thirds of the observed cases to lie within the limits of plus or minus 
6.22 from Y’. This situation is illustrated graphically in Fig. 15.6. 
There we have the regression line, along which the predicted Y’s lie, 
and in dotted lines we have the limits of one ayz on either side of it. Had 
we plotted a point for every individual, we should have expected about 
two-thirds of them to fall between the two dotted lines. To make a par- 
ticular prediction, when X = 90, Y = 122.8. The odds are 2 to 1 that 
any individual whose X score is 90 will not fall below 116.6 or go above 


129.0, We could state other odds for a divergence of 20 either way or 


any other distance. It all depends upon our purposes. 


and 


406 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


We could prepare a similar diagram showing the limits of the middle 
two-thirds of the individuals about the regression of X on Y, and we 
could interpret the errors of prediction in a similar manner. It will be 
noted that the margin of error as given by oz, is 6.02, or 0.2 smaller in 
predicting in the other direction, i.e., X from Y. , but this is merely because 
oz is smaller than cy. The percentage of error is the same in the two cases. 
The ratio of eyz to cy is exactly the same as the ratio of Gzy to oz, and 


that ratio is given by the factor V1—?r. This factor we will meet again 
with a name attached to it (see formula 15.21) 


.|7wo-thirds of the 
Individuals lie within 
the limits of the 
dotted lines 


55 60 65 70 75 80 85 40 95 100 105 
> Scores in test X 
Fic. 15.6. i i 


—Th 1 upon X, showing the range of observed values 
€ ected in F in Separate categories of sc on X. Parallel dotted lines above and 
elow the regression line at a vertical distance of one standard error of the estimate each 
ve expect two-thirds of the observed values to be. 


the data that the sum of he deviations of observed points 
from the line is a minimum. Other lines 


squares. It is reasonable that if the line is a mean, the deviations from it 
should be measured by a standard deviati 


ton. That standard deviation is 
the standard error of estimate. eviation 


1942, 7, 85-102. 


PREDICTION OF MEASUREMENTS 407 


Correction of a Standard Error of Estimate for Bias.—In smaller 
samples (N is less than 50) it would be well to make a correction in cyz 
(or czy) before applying it to the population. The change can be made 
by the formula 

N 


Cuz = Cy24! 5 (Correcting cyz for bias) (15.17) 


where N is the number in the sample. The correcting can be done as 
well in the original computation, as follows: 


die Heed (45) (15.18) 


The Reliability of a Regression Coefficient.—The b coefficient in the 
regression equation has its sampling error, like all statistics. This is esti- 
mated by 


Cuz 
Toy = —“= ; 15; 
ee Oe 
or by (Standard error of a regression coefficient) 
a LHF 
Oe = Aa £ (15.20) 


The o», would be the same, except for changing the x and y subscripts 
around. For our examination problem 


z 6.22 
Tesz = (7.60) (9.3274) 

u 

~ 70.88824 

= .088 


We may say that the odds are 2 to 1 that the obtained byz of .63 does not 
deviate from the population by: by more than .088. There is very little 
chance that the true b coefficient here is zero. 


Tur CORRELATION COEFFICIENT AND AccuRACY OF PREDICTION 


The chief index of goodness of prediction of measurements thus far in 
this discussion has been the standard error of estimate. It has been 
shown how the latter is closely related to the coefficient of correlation. 
andard error of estimate decreases. There are other 


As r increases, the st 
can be used to indicate accu- 


ways in which 7 and some of its derivatives 


408 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


racy of prediction. Three of the common derivatives are the coefficient of 
alienation, the index of forecasting efficiency, and the coefficient of determi- 
nation. Each has its unique story to tell about the closeness of corre- 
lation between two things and about the utility of predictions. 

The Coefficient of Alienation—Whereas r indicates the strength of 
relationship, the coefficient of alienation, k, indicates the degree of lack of 
relationship. By formula, 


E i =a (Coefficient of alienation computed from r) (15.21) 
1.0 


0 

o 01 02 03 04 05 06 07 08 09 10 
Coefficient of correlation (r) 

Fic. 15.7.—Chart showing & (coefficient of 


7 alienation) and d (coefficient of determination) 
as functions of r (coefficient of correlation) 


Squaring both sides of this equation, w 


e have 
P=1—72 
And transposing, we have 
k? + 7? = 1.00 


Thus, although we might have expected k plus r to equal 1.00, it is rather 
the sum of their squares that equals 1.00. If > is 50, k is not also .50 
but .886. When 7 is .50, then, the degree of relationship is less than the 
degree of Jack of relationship. It js when ry = .7071 that relationship 
and lack of relationship are equal, for k also then equals .7071. Then 
r+ k? = 50+ .50 = 1.00. Other values of k for different sizes of r 
can be found in Table 15.5. Figure 15.7 shows Pictorially the functional 
relationship between k and r. Students of mathematics will recognize 


PREDICTION OF MEASUREMENTS 409 


TABLE 15.5.—INpICATORS OF THE ĪMPORTANCE OF COEFFICIENTS OF CORRELATION 


100 (1 — Bes) : 

hey Percentage reduc- 1007 =v 

Try Coefficient of | tion in errors of Percentage oi 
alienation prediction of Y variance accounted 
from X for 

-00 1.000 0.0 0.00 
.05 .999 otk 0.00 
-10 -995 3 1.00 
15 989 LA 2.25 
-20 .980 2.0 4.00 
25 968 3.2 6.25 
-30 954 4.6 9.00 
35 937 6.3 12.25 
-40 917 8.3 16.00 
45 893 10.7 20.25 
-50 .866 13.4 25.00 
55 835 16.5 30.25 
-60 . 800 20.0 36.00 
.65 -760 24.0 42.25 
-70 714 28.6 49.00 
75 661 33.9 56.25 
-80 - 600 40.0 64.00 
85 527 47.3 72.25 
.90 436 56.4 81.00 
.95 -312 68.8 90.25 
.98 199 80.1 96.00 
.99 141 85.9 98.00 
995 -100 90.0 99.00 
999 045 95.5 99.80 


i a 


_ the relationship 7? + kX? = 1.00 as the equation for a circle with a radius 


of 1.00. The diagram shows only positive values of r and k.! 

Sometimes we wish to stress the point of independence between two 
things rather than their closeness of agreement. In such instances, we 
present k as well as 7. Besides being related to r, k is also related to other 


indices of goodness of prediction to be mentioned next. 


same as that of the sine of an angle to the cosine of that 


1 The relation of k to v is the 
alues of r can be found by using Table J 


angle. Values of k corresponding to known v: 
in Appendix B. 


410 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The Index of Forecasting Efficiency.—In the formula for the SE of the 
estimate, oy: = oy 1/1 — ryz, we can now see that the factor under the 
radical, 1/1 — T% is really the coefficient of alienation. We could 
rewrite the formula as cy: = cyky». If we were to multiply k by 100, 
we should have the percentage cy» is of oy. When r= 61, as in our 
recent illustration, $ = .7924. The SE of the estimate in this problem is 
79.24 per cent of the observed dispersion of observations. Our margin of 
error in predicting Y with knowledge of X scores is about 79 per cent as 
great as the margin of error we would make without knowledge of X scores. 
For then we predict every Y to be the mean of the Y’s, and the SE of the 
prediction then equals cy. The reduction of our margin of error is 100 
minus 79.24, or 20.76 per cent. The index of forecasting efficiency is 
defined as the percentage reduction in errors of prediction by reason of 
correlation between two variables. The general, simplified formula is 


E =100(01.— ff — r?) (Index of forecasting efficiency) (15.22) 


or 


E = 100(1 — 8) 


, this does not seem like much of a gain. 


There are situ- 
ations, however, in which, as will be show 


n later, a gain of even less might 


Better tests, with validity coeffici 
and still better tests, when r = .75 


ay also seem small, we must treat them ina 
It is probable t 


ments in psychological and educ: 
correlations greater than .8 with 
less than .3 are usually of limite 


ational practices, 
practical criteria, an 
d value when used al 


Tests rarely show 
d those correlating 
one. In a battery 


PREDICTION OF MEASUREMENTS 411 


to which they make a unique contribution it may still be worth while to 
use them. The corresponding limits on the scale of Æ are 4.6 and 40. 
The Coefficient of Determination—Another mode of interpretation of 
r is in terms of 72, which is called the coefficient of determination. This 
statistic is also sometimes symbolized as d. The coefficient 7? gives us 
(when multiplied by 100) the percentage of the variance (see Ch. 5) in Y 
that is associated with or determined by variance in ¥. When r = .50, 
the percentage of the variance in Y that is accounted for by variance in 
X is 25, or one-fourth. To account for half the variance of any set of 
measurements, the r with another variable would have to be .7071. The 


A 
PHHH 


CCE ra 
BREMF% 


Index of forecasting efficiency (E) 


0 Ol 02 03 04 05 06 07 08 09 1.0 
Coefficient of correlation (r) 


Fic. 15.8—E (index of forecasting efficiency) as a function of r. 


proportion of the variance in Y xot determined by or associated with 
variance in X is given by &’, which is called the coefficient of non- 
determination, These statements about determination of Y by X are 
reversible and apply equally well to determination of X by Y. We should 
speak of determination of one thing by another, however, only when a 
causal relationship can be logically defended; otherwise the expression 
associated with or accounted for (by way of prediction) is better. In 
Table 15.5, several of the 1007? values are given for corresponding r’s. 
In Fig. 15.7 is presented graphically the functional relationship between 


d andry. , i 
Predicted and Nonpredicted Variances —The coefficient of determination, 


as well as its relations to 7, k, and other statistics, can best be clarified by 


introducing another new idea. The total amount of variance in the pre- 


412 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


dicted variable, Y, we denote by o”,. We can think of this variance as 
being broken down into two independent components, the predicted and 
the nonpredicted portions. The predictions of Y, which we have called 
Y’, have their dispersion and their variance which are denoted by o, and 
o*y, respectively. The standard deviation cy would be computed from . 
the deviations of the predicted values about the mean of the Y values, M,. 

The amount of nonpredicted variance is indicated by the square of the 
standard error of estimate (¢%,.). This statistic is computed from the 
deviations of the obtained Y values from the regression line (or from the 
predicted Y values). The two component variances of o*, are therefore 


2y = Oy toys (Component variances in the predicted variable) (15.23) 


If we divide this equation through by o 


2), we will have everything in 
terms of proportions. 


2 2 5 
oy a, Cyt Ae Ove __ 1.0 (Total variance as the sum of two pro- 
a N Cia. i 
2 gy oy portions) 


9 


(15.24) 


The first term on the right, o?,,/o%,, is the proportion of the variance in Y 
that is predicted and the second term is the proportion of the variance 
that is not predicted. We have already defined 7? as the proportion of 


predicted variance and k? as the Proportion of nonpredicted variance. 
This means that 7? equals o%,/o*, and that %2 equals o*,,/o%,, and that 
r = oy/o and k = Oyz/Sy. 


We therefore have some new concepts of r 
and k. We can say that r is the ratio of the dispersion of predicted values 
to the dispersion of obtaine 


d values and that & is the ratio of the dis- 
persion of errors to the dispersion of obtained values, 


EFFECTIVENESS OF SELECTION TESTS 

Although the coefficient of cor 
yz, are all accurate and meanin 
predictions, and they serve we 


xpert may find it 


conclusions in other terms. This is true, 


particularly, when we are dealing with selection tests, 

Those concerned with the administrative problems of selecting personnel 
by means of tests find that a different kind of enlightenment is desirable 
than that provided by the statistics in question. It is one thing to know 
that by the use of this test Score, or this composite Score, errors of pre- 


PREDICTION OF MEASUREMENTS 413 


diction are reduced 15 per cent. But what does this mean with regard 
to the number of applicants one must examine, and what proportion one 
must accept for training in order to have a certain number of successful 
employees at the end of training? With the same number of applicants 
_ selected, how many more satisfactory ones will we have with the aid of 
the selection test than we would have had without it? Even if we could 
get the employer to grasp the idea of the index of forecasting efficiency 
as an abstract indicator of amount of gain achieved by the test, to most 
laymen the Æ values actually reached by most test procedures sound very 
unimpressive because laymen generally lack the proper experience to 
evaluate them. For these reasons, several suggestions have been made 
in recent years for more realistic and fruitful ways of evaluating selection 
tests. One of these will be described in some detail and the others men- 
tioned in principle. 

Determiners of Effective Selection—Everything else being equal, 
validity coefficients (and statistics derived from them) are accurate 
indices of the effectiveness of selection tests. It has been pointed out, 
however, that the correlation of a test with a practical criterion is not 
the only thing to be considered when practical decisions must be made. 
The practical utility of tests in any training or job situation depends upon 
other factors than the validity of the test or test battery. It depends 
upon the percentage of employees who would have succeeded if testing 
had not been applied in selection. It also depends upon the percentage 
of the applicants who are selected by means of the tests. 

The Taylor-Russell Method.—Taylor and Russell have rationalized the 
problem in a clear manner.! Following their exposition of the matter, 
the selection situation with tests is described in Fig. 15.9. The X axis 
represents the scale of test scores and the vertical axis represents the scale 
of the training or job criterion. Let us assume that the correlation 
between test and criterion is about .50. The ellipse describes the dis- 
persion of individuals in this two-dimensional surface. On the X scale a 
point X, is marked. This is an arbitrary critical or qualifying score on 
the test. Individuals with scores above X, are selected and those with 
scores below X; are rejected. 

Without selection on the basis of the test, a certain percentage of the 
accepted applicants would have succeeded. We assume a continuous 
variable for the criterion as well as for the test. The point Ye is an 
arbitrary critical criterion value above which the verdict is success and 
below which the verdict is failure. By drawing lines at X. and Y, parallel 


1 Taylor, H. C., and Russell, J. T. The 
practical effectiveness of tests in selection. 


relationship of validity coefficients to the 
J. appl. Psychol., 1939, 23, 565-578. 


414 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


to axes Y and X, respectively, we divide the population into four kinds of 
individuals defined as follows:! 

A—Individuals who if selected would succeed. 

B—Individuals who would be rejected but who if allowed to compete 
would succeed. 

C—Individuals who if selected would fail. , 

D—Individuals who would be rejected and who if allowed to compete 
would fail. 


28 , 
à 5 Rejected Se/ected 
` buf would and 
g have succeeded succeeded 
8 
$% 
q D C 
8 Rejected | Selected and 
8 and would failed 

s are failed 

z 

g 


Rejected Xe Se/ected 


Test score 
Fic. 15.9.—Correlation surface subdivided by a critical score (Xe) which separates the 
population into selected and rejected groups of individuals on the basis of test results, and 
y a critical criteriongyalue, (Y.) which Separates the same population into successful 
and unsuccessful in a job assignment. 


scale. In attribute-prediction 
fixed by the nature of things. 

1 The letter symbols—A, B, C, D—are de 
and Russell. Here they have been made 


gories—a, b, c, d—in the usual 2 x 2 contingency table, 


—_— 


PREDICTION OF MEASUREMENTS 415 


We are now ready to consider two new concepts proposed by Taylor 
and Russell. One is the success ratio and the other the selection ratio. 
The success ratio is the proportion of accepted candidates who would be 
successful. There would be a certain success ratio without the use of 
selection tests, and another success ratio with the use of tests, provided 
the tests have any validity at all, and provided some selection occurs. 
The selection ratio is the proportion of all applicants examined who are 
accepted. In terms of symbols and equations, the success ratio without 
the use of tests is 


A+B _ A+B (Success ratio without. the (15.25) 


So = ALBECGED N use of selection tests) 


where letters A, B, C, and D, are defined as in Fig. 15.9. When there 
has been selection on the basis of a valid test, 


A 


> A Fé (Success ratio with the use of tests) (15.26) 


S: 


The selection ratio is 


A+C ATG 


z = i i 15.27 
Pe A+B+C+D N (Selection ratio) ( ) 


Favorable Success Ratios (before Selection) —A few examples will illus- 
trate the fact that effectiveness of selection by tests depends upon the 
success ratio that would prevail without that selection. It is obvious 
that if all trainees or employees would be satisfactory without the use of 
selection tests, there would be little excuse for using them. The chances 
of improving matters by this approach would be nil, except as the quality 
or average production of satisfactory personnel were raised as a result. 
When the success ratio without tests is very low, there is much room for 
improvement, and with valid tests some improvement is bound to occur. 

Consider Fig. 15.10 in this connection. There four special situations 
are shown; cases of high and low test validity combined with high and 
low success ratios. In diagram I, the success ratio is high. One could 
move the critical score over a considerable range without changing the 
success ratio very much, until the selection ratio became very small. In 
diagram II, even where the correlation is high, a change in the cut-off 
Score would disqualify very few potential failures, and eliminating even a 
few would result in losing many more potentially successful candidates. 
In diagrams III and IV, the success ratios are very small. In diagram 
III, even a small number of rejections would disqualify many potential 


416 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


failures with little or no loss of potential successes. This is even more 
true where the validity of the test is much higher, as shown in diagram IV. 
In general, then, we stand to gain most when success ratios without testing 
are small. 

Favorable Selection Ratios —If the number of applicants relative to the 
number of places to fill is small there is, of course, not much opportunity 
for selection. In the limiting case, if no one could be rejected there 


Job proficiency 


( 
pag 
i 


x 
x 


Diagram [ 


Job proficiency 


Xc Test score Xe Test score 


Diagram IT 
Fic. 15.10.—Diagrams similar to that in Fig. 15.9, with di 


ratio and success ratio (for definition of these ratios see 
of validity. ? 


Diagram IV 


fferent combinations of selection 
the text), and different degrees 


would be no use of selection tests, On the other hand, if there are many 
more applicants than places, and if one can then skim the “cream” off 


the top of the applying group, the chances of improving th 
accepted personnel would seem to be great. 


whereby the “cream” can be Properly reco 
that. But how valid must a test be before t 


n e quality of 
This presupposes a method 


Figure 15.10, diagram III, shows that eve 


apa EAA n a test of rather low validity 
may be effective in skimming the “ 


cream” in a negative way. That is, 


PREDICTION OF MEASUREMENTS 417 


it can do much to eliminate failures. One could move the critical score a 
considerable distance and still reject several times as many potential 
failures as he would lose among potential successes. From diagram I, 
however, we get the suggestion of a warning that if the qualifying score is 
set too high in a test of low validity we may be losing some of the very 
best qualified. We cannot press refined decisions of this kind too far on 
the basis of these diagrams because the populations are not uniformly 
distributed throughout the elliptical areas; they thin out around the 
margins. The general tendencies, however, should be clear. 

It is clear from what has been said above that a test of low validity 
may be very useful in selection under a favorable set of conditions. Those 
conditions include certain combinations of success ratios and selection 
ratios. It can also be seen that even a test of high validity may be of 
little or no value if the conditions are unfavorable. Consider diagram II, 
in which the success ratio is very high. One could not eliminate many 
potential failures without losing many more satisfactory personnel. The 
higher the critical score, however, the more satisfactory the successful 
personnel would tend to be. It depends upon whether we are interested 
in numbers of successful individuals or in average quality. There are 
administrative questions of balance, also. It might be disadvantageous 
to take on at one time a whole class of prima donnas! 

Some numerical examples may be given to illustrate the points just 
made concerning favorable success and selection ratios. Let us assume a 
validity coefficient of .60, a typical value for good selection batteries. 
Let us also assume normal distributions in both test and criterion. If the 
success ratio S, is .95, by rejecting 40 per cent of the applicants we could 
achieve a success ratio S; of .99. This is an improvement of only about 
4 per cent over the results without the tests. Compare this with the 
index of forecasting efficiency which is 20 per cent when 7 = 60. To 
bring the Sı up to 1.00, approximately, we would need to reject at least 
60 per cent of the applicants. In either case, we reject about 10 appli- 
cants to gain one more successful individual. Rejections beyond 60 per 
cent would gain us practically nothing. Nerd 

Let the success ratio S, be .05, and what is the result? A rejection of 
S5 per cent of the applicants would net an increase of .05 in the success 
ratio, a gain of 100 per cent. By rejecting as many as 95 per cent the 
St could be raised to .30. This is a gain of 500 per cent. Compare these 
Percentage gains with the index of forecasting efficiency of 20. 

To take less extreme instances of Se, let us assume ratios of .80 and .20, 
with r still equal to .60. With the high So of .80, we need to reject about 
60 per cent in order to raise S: to .95, a gain of 17.5 per cent. With the 


A18 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


low So of .20, the rejection of 60 per cent yields a success ratio of .38, 
a gain of 90 per cent. 

A Graphic Chart of Relations of S, to Selection Ratio—Figure 15.11 
shows, for the situation when the validity coefficient is .60, the change in 
success ratio S; as the selection ratio changes. Each curve represents a 
different initial or basic success ratio, S,. Taylor and Russell provide 
tables which record these same relationships for various validity coef- 


ficients and Guilford and Michael provide charts similar to Fig. 15.11 
for other validity levels. 


go z CE 25095 
X 

$ 1 
= =: So* og; 

i = 

Site a Pe m~ a 
3 ZA 
S070}— Tok TL 
N =| S 

g 060 = i eso 
> eis 

X 0.50 | SS 

; fr re 
3040 ssc 

joke Pen, 
$030 |- ! 

Š SoS 0.20 
< 020 L 7 

N A + S, Í 
So10 L oes 
Ñ ai So=0.05 
& o 

0 0.10 020 030 040 


050 060 070 080 090 1.00 
election ratio or the pro, ortion selected 
Fie. 15.11.—Chart SEA mie 


relating success ratio to selection ratio when the validity coeflicient 
is .60. 


Indices-of-Improvement Methods.—In the Taylor-Russell method of 
test evaluation our attention is con 


n our centrated upon numbers and percentages 
of successful individuals. We ask what is the percentage increase in the 
numbers of satisfactory personnel, without specifying anything about the 
degree of satisfaction. Much depends upon the placing of a passing point 
on the criterion scale and an ignoring of the fact that success is st 
variable. In terms of planning in selection and training E par- 
ations, where numbers of recruits may be liberal 
and standards of passable satisfaction are readily established, this kind 

? 


1 Taylor and Russell, of. cit.; Guilford, J. P. and Michael, W. B icti 
n 3 2 W.B. P = 
gories from measurements. Beverly Hills, Calif.: Sheridan Supply Co a | eg 


: 


5 ig 


PREDICTION OF MEASUREMENTS 419 


of evaluation of a selection instrument or program is adequate and well 
adapted. There are other procedures, however, that concentrate more 
upon the fact of graded excellence in criterion measures, and which 
involve thinking in terms of work output of personnel. The worth of a 
selection program is established if we can demonstrate a certain per- ~ 
centage increase in production of some kind. If the criterion is measured 
in terms of absolute amounts of production of workers, we may ask, 
“What percentage improvement in production does test selection bring 
about?” ‘The answer can then be balanced against the cost of the testing 
program. 

The Jarrett Method.—Although the first suggestion for this kind of 
index of test evaluation was made by Richardson," a more useful pro- 
cedure was developed by Jarrett.? With somewhat different symbols 
than those used by Jarrett, his index of improvement can be computed 
by the formula 


Mp— M: (Percentage improvement in output for a 
( = ) selected group) (15.28) 


where 7, = validity coefficient for the test. . 
vy = index of variability of criterion scores given by the equation® 


oe. (Relative variability of measurements) (15.29) 


we 
v 


where M, = mean of test scores for the selected personnel. 
M, = mean of test scores for all applicants. 
M, = mean of criterion measurements. 
cy = standard deviation of the criterion measures. 
If we may assume that the criterion measures are normally distributed, 
the last term in formula (15.28) is equivalent to the ratio y/ps and we have 


= y (Percentage improvement in output for a selected 
T= ty 5 group f a normally distributed criterion) (15.30) 


Ps 


Richardson, M. W. The interpretation of a test validity coefficient in terms of 
increased efficiency of a selected group of personnel, Psychom., 1944, 9, 245-248. 

2 Jarrett, R. F. Percent increase in output of selected personnel as an index of test 
efficiency, J. appl. Psychol., 1948, 32, 135-145. be y 

3 The statistic v, will be recognized as 00 of the coefficient of variation given in 
Ch. 5. Here, as well as there, the measurements must be in terms of a scale with an 
absolute zero point. Piecework scores, dollar values of output, and the like, qualify 


for the use of this statistic. Ratings would not qualify. 


420 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


where p. = proportion of applicants selected. f s 
y = ordinate in unit normal distribution curve at point marking 
off p proportion of cases. 

An inspection of formula (15.30) leads to some interesting inferences 
which agree with things already pointed out. With vy and y/p constant, 
I is entirely dependent upon the validity of the test and directly pro- 
portional to it. With ryz constant, J increases as vy increases. That is, 
the more variable the criterion measures with respect to their mean, the 
greater is the improvement resulting from selection. It is reasonable 
that if all workers performed equally well there would be little use to 
attempt to discriminate among them by means of tests. The better they 
can be discriminated in terms of individual output, the better the chance 
there is of differentiating among them by means of predictive instruments. 
The factor y/p, as will be seen in Table G, is larger as p approaches .00 
and smaller as p approaches 1.0. When $ = .01 this ratio is about 100 
times as large as when $ = .99. This principle agrees with the one apply- 
ing to the Taylor-Russell method: that the lower the selection ratio, the 
greater the benefit from selection. 

Evaluation in Terms of Cost and Utility.—Berkson h 
oped a procedure which emphasizes a com 
with its cost. Utility is defined as the P 
that would be eliminated by the test. 
graduates the test would eliminate. 


to Fig. 15.9. Utility would equal 100D/(C + D). 
100B/(A + B). The indices are, of 


as recently devel- 
parison of the utility of a test 


e assumed 
tion that throughout the range, the higher 


age criterion performance of the individu. 


+ Berkson, J. Cost-utility as a measure of the efficienc. f 
. stal. 
Assoc., 1947, 42, 246-255. y of a test, J. Amer. sta 


PREDICTION OF MEASUREMENTS 3 421 


the highly intelligent person, but for predictive purposes we do not par- 
ticularly need to know the reasons. The fact of curved regression is 
undeniable and should be recognized in selection. It is likely that when 
the whole range of intelligence is studied in relation to job proficiency of 
many kinds, there will be found an optimal intelligence level for each 
kind of job. Curved regressions are often overlooked because the investi- 
gator fails to plot scatter diagrams, or because he has a restricted range in 
his population. In application for jobs, there is often enough self-selection 


ee | 


Proficiency in a job assignment 


e bE a 


TO 7 
75 x7 100 125 x% 150 


Score (1Q@- equivalent) on 
an intelligence test 
Fic, 15.12.—A curved regression of a job-proficiency criterion variable on the test-score 
variable X, showing that a high cutoff score may be needed in addition to a low one. 


beforehand so that a limited range appears for examination. The result- 
ing regression is therefore often linear within that range and some corre- 
lations are zero because in that range there is no upward trend in Y as 
X increases. In relating certain temperament-test scores to rated pro- 
ficiency of administrators, for example, the writer has found a few unde- 
niable signs of curvature, with the optimal score not at the top. Relations 
of other temperament scores to job proficiency measures in such routine 
tasks as cigar wrapping and stocking pairing reveal optimal scores below 


the average, that is, toward the extreme ordinarily denoted as poor 


personality traits. 
Wherever curvature such as that show: 
the data, two critical scores may be called for. 


n in Fig. 15.12 is indicated by 
Tf a cutoff score were 


422 FUNDA MENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


placed at X., then all the personnel above that point are apparently about 
equally good in terms of job proficiency. If the cutoff point were moved 
up to Xe, however, there are individuals having scores at the upper end 
who are just as poor performers on the job as many below X.. A second 
critical point at Xe» would eliminate the high-scoring but below-optimal 
performers. If selection were further restricted, it should be restricted 
from both directions. 

The problems of evaluating selection devices when regressions are non- 
linear are more complex than those we have already seen. None has 
been worked out for this kind of situation, but variations of methods 


already described would serve. The fundamental principles would be 
the same. 


Exercises 


1. Using the data of Table 14.10, predict the most probable score in the personality 
inventory for alcoholics and nonalcoholics, and for the two combined. What is the 
margin of error of prediction as made in these three ways? 

2. Compute a standard error of esti 
What does it tell us? 


3. What is the most probable total score for the passing and failing students repre- 
sented in Table 13.4? What is the accuracy of prediction for each category? How 
much improvement from knowledge of category? 

4. For Data 154, find the best prediction of score in the Opposites test correspond- 
ing to each midpoint score in the Mixed-sentences test. Estimate the margin of error 
for each prediction and for the predictions taken as a whole. 


5. Find the two regression equations for Data 154. Make all possible checks as 
to internal consistency of your computations. 


mate for the prediction problem in Exercise 1. 


Data 154.—A SCATTER DIAGRAM For Two MENTAL TESTS 


Y (Opposites X (Mixed-sentences test in Army Alpha) 
test in Army 
Alpha) 0-2 | 3-5 | 6-8 | 9-11 12-14 | 15-17 18-20 | 21-23 ty 
36-38 1 1 
33-35 1 2 3 
30-32 1 1 3 7 2 14 
27-29 4 5 2 11 
24-26 1 3 3 2 4 4 17 
21-23 1 6 1 5 2 15 
18-20 1 2 1 9 5 4 22 
15-17 1 2 2 2 2 1 12 
12-14 1 2 0 2 2 1 8 
9-11 1 2 1 B 9 
6- 8 1 1 
fe ci So ee a 25 18 27 13 | 113 


PREDICTION OF MEASUREMENTS 423 


6. Using the appropriate regression equation, make a prediction of score in the 
Opposites test corresponding to each midpoint score in the Mixed-sentences test. 
Compare these predictions with those obtained in Exercise 4. 

7. Compute the two standard errors of estimate for Data 15A. What are the 
amounts of predicted and of nonpredicted variance in Y? What are the proportions of 
these two kinds of variances here? 

8. Draw a diagram like Fig. 15.6 that applies to Data 154. Draw another diagram 
like Fig. 15.4 showing the two regression lines. 

9. Derive the statistics k, Æ, and r? for Data 15A. Interpret these findings. 

10. Using formula (15.15), compute a regression equation for the first 10 pairs of 
scores for Parts V and VI in Data 84. 


CHAPTER 16 
MULTIPLE PREDICTION 


MULTIPLE CORRELATION 


Independent and Dependent Variables.—Thus far we have been deal- 
ing with correlations between two things at a time and the prediction of 
some variable Y from another variable X. , or vice versa. Actual relation- 
ships between measured things in psychology and education are by no 
means so simple as that. One variable is found associated with, or depend- 
ent upon, more than one other variable at the same time. When we can 
think of some variables as being causes of another one, or even when we 
merely want to predict that one from our knowledge of several others 
that are correlated with it, we call the one variable the dependent variable 
and the ones upon which it depends the independent variables. The inde- 
pendent variables are so called because we can manipulate them at will or 
because they vary by the nature of things, and in consequence, we expect 
the dependent variable to vary accordingly. 

Whether or not a certain color is | 
its hue (whether yellow, red, or purpl 
medium, or dark), and its chroma (sa 
value of the color also depends upon 
We are here naming independent var 
of a color depends. In so far 
of color, it will exhibi 


iked depends upon several factors: 
e, etc.), its lightness (whether light, 
turation or density). The affective 
its area, its use, and its background. 
iables upon which the affective value 


represents the dependent variable. 
centage of graduates—not an ordin. 
nevertheless, show the principles in 
ables are shown as sides of the base, 
424 


MULTIPLE PREDICTION 425 


scale of chronological age is shown reversed for convenience, since the 
correlation between age and the training criterion was negative. Both 
independent variables are shown here in very coarse categories for the 
sake of a simpler diagram. 

By noting rows of blocks (left to right) we can see how graduation rate 
changes with age for a relatively constant level of aptitude. By noting 
the columns of blocks (front to back) we can see how graduation rate 
changes with aptitude score for a relatively constant age level. The term 


my} 


mi 7 
82.8% 
W *°* 
a 


HA 63.2% 


T 
iur 


ating from pilot training as a 
(Adapted from an unpublished 
Texas.) 


Percent graduating 


Fic. 16.1—A multiple regression, with percentage gradu 
function of both chronological age and aptitude score, 
report of Headquarters, AAF Training Command, Fort Worth, 


constant covers an unusual range in this illustration, but with finer group- 
ing on age and aptitude we would expect similar trends. It is obvious 
that the regressions for the criterion on aptitude are much steeper than 
those for the criterion on age. The difference would be even more appar- 


ent if we had the criterion in terms of a properly graded measurement 
scale, The correlation between aptitude scores and the criterion was 
much higher (approximately .55) than that between age and the criterion 
(approximately —.10). A very rough appreciation of the joint Poa 
value of aptitude score and age can be seen by noting the change of heig 


426 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


from the lowest block (29.9 per cent) to the highest (90.0 per cent). This 
change may be compared with those changes across columns alone or 
across rows alone. From this comparison we should expect better pre- 
diction from both independent variables than from either alone. 

The Coefficient of Multiple Correlation.—When we are interested in 
the amount of correlation between a dependent variable and two or more 
others simultaneously, we are dealing with a multiple-correlation problem. 
The multiple coefficient of correlation indicates the strength of relation- 
ship between one variable and two or more others taken together. The 
multiple correlation is not merely the sum of the correlations of the depend- 
ent variable and the various independent variables taken separately. 
Obviously, there would be instances in which these would add up to 
more than 1.00.) One reason is that independent variables themselves 
are usually overlapping (intercorrelated) and so duplicate one another to 
some extent. In this we see one important principle of multiple corre- 
lation. The multiple R is related to the intercorrelation of independent 
variables as well as to their correlation with the dependent variable. The 
interdependency of the determiners suggested for affective value of colors 
is probably not so apparent as in the case of factors related to achieve- 
ment in college algebra. Here we can think of such predictive factors as 
intelligence-test scores and high-school marks, whi 
cate one another to some extent in 
algebra. Hours of study and interes 
So are not completely independent determiners of success in algebra. 

A Multiple-correlation Problem.—In Table 16.1 are presented some 
data that call for the multiple-correlation solution. Four of the variables 
(X2, Xs, X4, and Xz) are all measures of things that supposedly determine 
academic success in college freshmen. X; is the dependent variable, or 
average freshman marks. It is customary to designate the dependent 
variable by X, though some authors, less often, call it Xo. An exami- 
nation of Table 16.1 shows that the analogies test and high-school average 
mark have the highest correlation, when taken alone, with X,, whereas 
the interest score X; has the lowest. The highest intercorrelations come 
between X2, Xs, and X4. All represent abilities of one kind or another, 
and their correlations with X, (interests) are generally lower. This gives 
promise that the interest scores will contribute something to the prediction 
of college marks that will not have been already contributed by the other 
variables, and so it should pay to include X; in the battery of predictive 
indices. As a matter of experience in psychological and educational pre- 
dictions, it has been a common finding that it rarely pays to bring into a 
multiple-prediction situation more than four or five independent variables. 


ch being related dupli- 
predicting achievement in college 
t also bear much in common and 


MULTIPLE PREDICTION 427 


TABLE 16.1.—INTERCORRELATIONS AMONG FIVE VARIABLES, INcLUDING ONE INDEX 
OF SCHOLARSHIP AND Four Prepictive Inpices (N = 174)* 


Variable X: X: X; Xs xX 
X: =s -562 -401 197 -465 
X: . 562 = -396 -215 -583 
X: -401 -396 — -345 -546 
Xs -197 -215 -345 —= -365 
xX 465 as 546 365 = 
Mz 19.7 49.5 61.1 29.7 73.8 
Oz 5.2 17.0 19.4 3.7 9.1 


Xa = arithmetic test in the Ohio State Psychological Examination, Form 10. 
X; = analogies test in the same examination. 

X, = an average grade in high-school work. 

Xs = student interest inquiry (measuring breadth of interest). 


Xı = an average grade for the first semester in university. 
* These data were abstracted from the Ohio State Coll. Bull. 58, by L. D. Hartson, and have been 


used in this chapter by permission. 


ll 


i] 


By the time that this many are combined, they have fairly well covered 
what any additional one can do for us. This is partly a consequence of 
the fact that good human qualities tend to go together (to be intercorre- 
lated) and partly that our predictive indices tend to remain in the same 
area of abilities, also ignoring personality factors, physical factors, and 
external circumstances. . i 

The Solution of a Three-variable Problem.—We first take the simplest 
case of multiple correlation, that between the dependent variable and two 
independent variables. In the general problem given by the data in 
Table 16.1, we may ask what is the correlation between freshman marks 


on the one hand and the two variables analogies-test scores and high- 


school averages on the other. The simplest general formula for this 


case is 
2 2., — Qrrieristas (Square of coefficient of multi- 
R? 1,93 tacta = “a ple correlation with three (16.1) 
1 —1'ss variables) 


where Rj.23 = coefficient of multiple correlation between X, and a combi- 
. 


nation of X: and Xs. 

Be sure to notice that this formula merely gives us R?, the square root of 
which is R. : 

The immediate example we have set for ourselves is to find Rı.s4 rather 
than Rj3. To use formula (16.1), we need merely to substitute the sub- 


scripts 3 and 4 for 2 and 3. The solution is 


428 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION’ 


_ (.583)? + (.546)? — 2(.583) (.546) (.396) 


Rin 1 — (396)? 
.339889 + .298116 — .252108 
~ 1 — .156816 
= 45766 
Riau 677 


\The Multiple-regression Equation——We also have here a prediction 
problem of estimating X, values from both X; and X4. This calls for a 
regression equation that involves all three variables, in other words, a 
multiple-regression equation. From such an equation, we can predict 
an Xı value for every individual. The correlation between these pre- 
dicted values (X’;) and the obtained ones (X,) would be .677. This is 
another interpretation of a multiple coefficient of correlation. For the 
three-variable problem, the regression equation has the general form 
X'ı = a + bi2sX2 + bi32X3. As in previous regression equations, the 
coefficient a is a constant and must be calculated from the data. Its 
function is to assure that the mean of the X’ 1 values coincides with the 
mean of the X, values. The b coefficients serve the same purpose here 
as in the simple, two-variable equation. The coefficient by».3 is the multi- 
plying constant or weight for the X values, and bis, is the weight for 
the X; values. The value of }y2.3 tells how many units X’; increases for 
every unit increase in Xe, when the effects of X; have been nullified or 
held constant. The value of 613.2 tells how many units X’, increases for 
every unit increase in X3, with the effects of X» removed from consider- 
ation, The particular b weights, as computed by the formulas given 
below, are the optimal weights. They assure the maximum correlation 
between predicted and obtained values. The solution, with the obtained 
b weights, satisfies the principle of least squares in that the sum of the 
squares of discrepancies between the X, values and the X’ 
be a minimum. 

Solution of the b Coefficients —We do not find tl 
from the correlations but do so indirect] 
coefficients. 


1 values will 


he b coefficients directly 


it y through the so-called beta 
Beta coefficients are called standard partial regression coef- 


Jicients—standard, because they would apply if standard measures were 
used in all variables; partial, because, as in the case of the coefficient of 
partial correlation (see Ch. 13), the effects of other variables are held 
constant. The b12.3 and bi3.2 are known as partial regression coefficients, 
because they, too, are weights that presuppose that other independent 
variables are held constant. They are given by the formulas 


N MULTIPLE PREDICTION 429 


bing = (2) Biss (16.2a) 
and (Partial regression coefficients) 
diss = («) Bis.2 (16.26) 
T3, 
The betas, in turn, are found by the formulas 
Big = 2 (16.32) 
1 — 7799 
; and (Standard partial regression coefficients) 
Bis2 = 28 — Tes (16.30) 
T= 


Similar equations apply, with change of subscripts, when the independ- 
ent variables are Xs and X4 instead of Xə and Xs. In our example 


583 — (.546) (.396) 
Bis. = T= (396)? 
__ 3668 
= 78432 
l =A 
and 

546 — (.583)(.396) 

Bus = — T (396)? 
3151 
18432 
374 


1 


We can now solve for the b coefficients by means of formulas (16.2ab): 


i 
biza = IO (.435) = -233 
and 
Cr DEPI os 
bins = 794 (.374) = .175 


For the complete regression equation, the a coefficient is still lacking. 


It is given by the general formula 


| a = Mı — bi2.3M2 — bis.2M3 (16.4) 


430 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 
Inserting the known values 


a 


73.8 — (.233)(49.5) — (.175)(61.1) 
= 73.8 — 11.53 — 10.69 
= 51.58 


The complete regression equation will then read 
X'i = 51.58 + .233X3 + .175X4 


To interpret the equation, we may say that for every unit increase in 
Xs, Xı is increasing .233 unit and that for every unit increase in X4, 
X. is increasing .175 unit. To apply the equation to a particular student 
whose X score is 25 and whose X; score is 32, we predict that his X1 


score will be 
X'i = 51.58 + 5.82 + 5.60 = 63.00 


We use X’, to stand for his predicted average freshman mark, because he 
has an actual average mark that we call X;. Some other examples of 


TABLE 16.2—Some PREDICTIONS OF SCHOLARSHIP MARK FROM Measures IN Two 


VARIABLES 
Student 
A B g D E 
X3 analogies score 25 27 48 85 87 
X4 high-school average... 32 61 65 90 52 
biaa Xa 5,82 6.29 | 11.18 19.80 | 20.27 
bis. X4 


5.60 10.68 11.38 15.75 9.10 


63.0 68.6 74.1 87.1 81.0 


individual students are presented in Table 16.2 to show how various com- 
binations of values for X. 3 and 


X4 point to i s 
Mitipls: keca i 4p corresponding values of X1. 
making predictions of scores in Xı 


_ Diagonal the figure, each 
representing the locus of identical predicted values. These lines repre- 
sent X’; scores at intervals of 5 units. Note, for example, the line for 
X'ı = 70. A prediction of 70 may arise from many diferent combina- 


MULTIPLE PREDICTION 431 


tions of Xs and X4. Choose several values, in turn, in the analogies test, 
for example, 10, 30, 50 and 70. Corresponding values in high-school 
average needed to yield predictions of 70 are 92, 65, 38, and 12, respec- 
tively. The chief use of the chart, however, is to find X’, for two given 
values in X; and Xy. For an X; of 20 and an X; of 50, the prediction is 
exactly 65. For an X; of 90 and an X; of 14, the prediction is exactly 75. 
When the prediction is not exactly on one of the diagonal lines, we interpo- 


Xs 
0 10 20 30 40 50 60 70 80 90 00 


100 N = 
90 P A 90 
80 Ne 80 
70 Š 70 
60 A X Jeo A 


è 


X (High-school average) 
w a 
So o 


0 
0 10 20 30 40 50 60 70 80 90 100 
Score in X; (Analogies test) 


Fic. 16.2.—Diagram showing constant values in the dependent variable for different com 
binations of saree in two independent variables, each weighted as called for by the 


multiple-regression equation. 


late, by inspection, between two lines. Thus, for X; = 40 and X4 = 70, 


the most probable X, is 73. The proportion of the distance between two 
diagonal lines must be estimated by the perpendicular distance between 
them. The perpendicular is in a diagonal direction. The reader may 
get further practice in using the chart by verifying the predictions 


found by computation in Table 16.2. [ ; 
Calculating the Multiple R from Beta Coefficients.—If the beta coef- 
ficients are known, the shortest route to the multiple R is by way of the 


equation 


R23 = Bis.stia + 613.213 (16.5) 


432 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Again, nole that this gives R?, from which the square root must be obtained. 
For the scholarship data and variables X; and X4, 


R'is, = (.435)(.583) + (.374)(.546) 
= .253605 + .204204 ` 
= 457809 

Riss = .677 


as was found by formula (16.1) previously. 

Interpretation of a Multiple R.—Once computed, a multiple R is subject 
to the same kinds of interpretation, as to size and importance, as were 
described for a simple z. One kind of interpretation is in terms of R?, 
which we call the coefficient of multiple determination. This tells us the 
proportion of variance in X, that is dependent upon, or associated with, 
or predicted by X; and X, combined with the regression weights used. In 
this case, R? is .4578, and we can say that 45.78 per cent of the variance 
in freshman marks is accounted for by whatever is measured by the 
analogies test and by high-school marks taken together, eliminating from 
double consideration things that they have in common. The remaining 
percentage of the variance, which is 54.22 (1 — R?), is still to be accounted 
for. This remainder is given the symbol K? and is known as the coef- 
ficient of multiple nondetermination. This is consistent with the fact that 
R + K? = 1.0, just as 72 + k? = 1.0 in the simple correlation problem. 

Relative Contribution of Independent Variables.—Since the coefficient 
of multiple determination, or R’, is composed of the two components in 
formula (16.5) and since each component pertains only to one of the 
independent variables, it is permissible to take each component as indi- 
cating the contribution of one independent variable to the total predicted 
variance of xX. This being the case, the first term, .253605, indicates 
we oan a e by ability in the analogies test, 

mae » indicates the contribution of the high- 


terms of percentages, these are 25.4 and 
€s us to obtain a more definite idea of the 
ariable in the regression equation. We can 


gies test, with what it has in common with 
high-school scholarship held constant, contributes about 25 per cent to 


freshman scholarship and that high-school marks, apart from that por- 
tion related to analogies-test ability, contribute about 20 per cent We 
cannot take these as final or absolute, for there are other Gictors con- 
tributing to freshman scholarship level that have not been similarly elimi- 
nated from consideration. But it is of much value to be able to compare 
contributions of variables to outcomes in this manner. 


relative importance of each v: 


MULTIPLE PREDICTION 433 


The Standard Error of Estimate from Multiple Predictions——The 
standard error of estimate is again brought in to indicate about how far 
the predicted values would deviate from the obtained ones. The formula 
is the-same as previously, except that the multiple R is substituted for r. 
It now reads 


i23 = 01 V1 — R123 (Standard error of multiple estimate) (16.6) 
In the illustrative problem, 


/1 — .457809 


134 = 9.1 
9.1 X .736 
6.7 


M 


We can now say that two-thirds of the obtained X, values will lie within 
6.7 points of the predicted X, values. The margin of error with knowledge 
of X, and X4 is 73.6 per cent as great as the margin of error would be 
without that knowledge. These conclusions presuppose predictions made 
on the basis of the regression equation that was obtained, and predictions 
made for individuals belonging to the population and sampled at random. 

The index of forecasting efficiency may also be used by way of inter- 
pretation and because of its close relation to the standard error of esti- 
mate may be mentioned at this point. The formula is the same as for a 
Pearson r [see formula (15.22)]. In the example of our three variables, 
E = 26.4 per cent, which means that predictions by means of the equation 
are 26.4 per cent better than those made merely from a knowledge of the 
mean of the X, values. 

Multiple Correlation in Small Samples.—For small samples,—and for 
multiple-correlation problems this means anything less than an NV of 100— 
degrees of freedom should be considered in dealing with questions of sam- 
pling. If the multiple R and the other statistics derived from it are to be 
used for estimating population parameters, there is even more bias than 
for a simple correlation problem. It was stated earlier that the multiple 
R represents the maximum correlation between a dependent variable and 
a weighted combination of independent variables. The least-square solu- 
tion that is represented in computing the combined weights assures this 
result; but it really assures too much. It capitalizes upon any chance 
deviations that favor high multiple correla tion. The multiple R is there- 


fore an inflated value. It is a biased estimate of the multiple correlation 


in the population. If we were to apply the regression weights in a new 
sample and to correlate predicted X1 values with obtained X, values, 
we would probably find that the correlation would be smaller than R. 
It is desirable, therefore, to find some means of estimating a parameter R 


434 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


which gives a more realistic picture of the general situation. A common 


way of “shrinking” R to a more probable population value is by the 
formula 


R =1-—(1— R3) = (Correction in R for bias) (16.7) 
where N = the number of cases in the sample correlated. 
m = the number of variables correlated, 
N — m = the number of degrees of freedom, one degree being lost for 
each mean, there being one mean per variable. 


For the illustrative problem above, where R = -677, the corrected R? 
would be 


1 — (.5422) (1.0117) 
4515 


oR? = 1 — (1 — 4579) = 


N-1 
aom = 01,93... >— 
CEE SRS xe N-m (General correction of a multi- 
ple standard error of esti- (16.8) 
=e ae R?) N-1 mate for bias) 
3 Ni se 


Standard Error of R—For an R derived from any number of variables, 
the standard error is 
_ 1-R 
m= VN -=m (Standard error of a multiple R) (16.9) 


The result is interpreted as for the standard error of any 7, 


MULTIPLE PREDICTION 435 


When the null hypothesis is to be tested, Table D is most convenient. 
The R’s meeting the 5 per cent and 1 per cent levels of significance are 
shown in columns headed by numbers of variables and rows headed by 
appropriate numbers of df. In the illustrative problem, V = 174, so the 
number of degrees of freedom is 171. The standard error of R is .041. 
The obtained R cannot very well be more than .11 from the population 
value of R (.11 being about 2.58 times oz). From Table D we find that 
with 150 degrees of freedom (the next lower and nearest to 171) and with 
3 variables, an R of .198 is significant at the 5 per cent level and one of .244 
at the 1 per cent level. We should have little room for doubt that a genuine 
multiple correlation exists in the population. 

Standard Error of a Multiple-regression Coefficient.—For the beta coeffi- 
cient the standard error is estimated by the formula 


o? z 1 — R”.234...m (Standard error of 
Bizatm = Cc R22, 34..m)(V — m) Sear coeffi- (16.10) 


The new symbol here is Re,s4...m, Which is a multiple R with Xe as the 
dependent variable and all other variables except Xi as independent 
variables, There would be one of these standard errors foreach of the 
independent variables in turn, each being substituted for Xs. For a 
three-variable problem, the R in the denominator reduces to 72s. Note 
that this formula gives the variance error, i.e., o°. 

For the b coefficient, the standard error is estimated by 


01,234... 
Fh2.346...m = [m (Standard error of a b coefficient), (16.11) 
2,34..mVN — m 


Needed in the denominator for each independent variable in turn is the 
standard error of estimate of that variable from all other independent 
variables. Beyond a three-variable problem this becomes quite laborious, 
but in the latter the denominator term reduces to ozs. Unlike the preced- 
ing formula, this gives the standard error without extracting a square root 
after it is solved. 

The chief use of these standard errors is to test the null hypothesis, to 
determine whether each independent variable has anything at all to con- 
tribute to prediction when its relation to other variables is taken into 
account. If the obtained beta or is not significantly different from zero, 


that variable might well be dropped from the regression equation, and a 
variables may add no more to a multiple 


new equation derived. Some 
R than would be well within the margin of error as indicated by ce. 


Rather than to go to the trouble of computing the standard errors cg and 


436 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


o», however, a decision could probably be more quickly reached, and per- 
haps just as dependably, by noting the proportion of variance a variable 
adds to R? and comparing this increment to R with cr. For example, in 
this simple problem already considered, we find that the analogies test 
taken alone could account for about 34 per cent of the variance in freshman 
grades; the correlation 713 was .583. The high-school average when taken 
together with the analogies test could account for about 20 per cent addi- 
tional variance, bringing the correlation to .677. With a standard error 
for ris equal to .051, it is unlikely that the correlation of -583 could have 
arisen by random sampling from a population in which the correlation 
Fıs is 677. With a standard error of .041 for the multiple correlation, it is 
unlikely that R could have arisen by random sampling from a population 
in which the Ñ is only .583. We therefore feel much confidence that X4 


has something unique to offer to predictions and something that is not 
merely favorable errors in the sample alone. 


SOME PRINCIPLES OF MULTIPLE CORRELATION 


While multiple-correlation problems may be extended to any number of 
variables, before we consider the solution with more than three, it is 
desirable to examine some of the general principles which apply for any 
number of variables but, which can be seen more clearly when there are 
only three. The two main principles are (1) a multiple correlation 
increases as the size of correlations between dependent and independent 
variables increases and (2) a multiple correlation increases as the size of 
intercorrelations of independent variables decreases. A maximum R will 
be obtained when the correlations with X; are large and when intercorrela- 
tions of Xs, Xz,..., Xm are small. In building a battery of tests to 
predict a criterion, test makers have usually tried to maximize the validity 
of each test and to minimize the correlations between tests. There are 
limitations to the application of these objectives, however, and in practice 
they tend to conflict, as we shall see. There are also apparent exceptions 


to the rules, as examples will show. The whole story is not told by the 
two principles as stated. 


Some Typical Combinations of Ti, 
some examples of various combinations 


d of outcome in each instance, 
1). Repeated here for ready 


2 2 
Rie = rie rag — 271271323 
23 = ——— E Mi 
1- r? 


— 


MULTIPLE PREDICTION 437 


Tf the correlation rəs is zero, the third term in the numerator is zero, which 
has a tendency to make Rj.25 larger. On the other hand, there is a dis- 
tinct advantage in having rə; very large, because of its role in the denomi- 
nator. If rs approaches 1.0, the denominator approaches zero. Even 


TABLE 16.3—EXAMPLES OF MULTIPLE CORRELATIONS IN A THREE-VARIABLE PROBLEM 
WHEN INTERCORRELATIONS VARY 


Example | ria ris ra | Ris Riss 
| } | 
1 | 4 4 0 | .3200 | .37 
2 | 4 4 ‘4 | .2286 | .48 
3 | 4 4 9 | .1684 | 41 
4 A 2 .o | .2000 | .45 
5 4 p ‘4 | .1619 | -40 
6 |4 2 ‘9 | .2947 | .54 
| | 
7 |a 0 o | .1600 | .40 
s | 4 0 4} 1905 | 4 
9 | «4 0 9} 842i | .92 
I. Í 
10 4 2 | —.4 | .3043 | -56 
11 4 | =g | t faas 7] 48 
pene oe ee ee ee 


though the numerator may become small, under these conditions R could 
be quite large. A large R is thus favored by having rss either very small 
or very large. ‘This principle should be added to the two mentioned above. 
But it should be said also that a large rss is more effective when the 
independent variables are unequally correlated with the dependent varia- 
ble, and particularly when one of the correlations is very small. 

Note the first example in Table 16.3, in which res = .0. For this event, 


formula (16.1) reduces to 


Riog = ie +s (Multiple R when_intercorrelations of two (16.12) 
independent variables are zero) 


In other words, when independent variables are not correlated, the pro- 
portion of variance predicted by their combination is equal to the sum of 
the proportions of variance predicted by each separately. This holds for 
any number of independent variables whose intercorrelations are zero- 
A psychological interpretation of this is that when intercorrelations among 
predictive measures are zero, the total contribution of each to the pre- 
diction of a complex criterion containing all the things predicted is unique. 

Note next the second and third examples and compare them with the 
first. In all three, the 712 and rı; correlations remain constant at .4, while 


438 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


rə increases first to .4 then to .9. As this happens, R goes from .57 to .48 
to .41. In the last instance rs; is so high that there is practically no gain 
from combining the two variables Xə and X3. We shall see a modified 
result in the next three examples. 

In examples 4 to 6, 712 remains constant at .4 and rı, constant at .2, 
while 723 varies from .0 to .9. In the first of these three we find formula 
(16.12) verified. The two variances sum to .2000 and R is .45. As rea 
increases to .4, R shrinks back to approximately .40. Thus we can con- 
clude that if one test has a validity of .4, it may pay to add to it another 
with a validity of only .2 provided the two tests intercorrelate zero. But 
if there is any appreciable correlation between them, or only a moderate 
correlation, it would not pay. What happens if we increase rə still more? 
When it is as high as .9, R jumps to .54. This supports the third principle 
stated above: that r23 should be either very low or very high. One may 
ask why this principle did not appear to work in the first three examples. 
The answer is that it was obscured by the relation of rı» and 713. In those 
examples riz equaled r13, and in the next three examples these correlations 
were unequal. A better explanation is that one of them is very small. 
One may well ask what psychological meaning is involved in the increase 
in R when res is very large. This is best explained in connection with the 
next three examples. 


In examples 7 to 9, ris and 143 are still more uneven in size. They also 


have special interest because 7}; = .0 in all three, while 723 varies from .0 to 


-4 to .9, as in the previous groups of three examples. It would seem, at 
first thought, that any test that correlates zero with a cr 
no value in predicting that criterion. It is true that alone it has no value 
whatever for doing so. But it is not true if that test is combined with 
other tests with which it correlates. In example 7, the common-sense 
expectation is vindicated. The addition of an invalid test would offer 
no improvement. It would simply receive a regression weight of zero 
which means it would not be included in the regression equation. But 
note that when re; is increased to -4, R becomes .44 and when ras is 9, R 
becomes 92. Clearly a test with zero validity may add materially to 
prediction if it correlates substantially with another test that zs valid. 

Suppression Variables—The psychological significance of this is best 
explained by factor theory (see Ch. 18). Roughly, the answer is that 
variable Xs, in spite of its positive correlation with x 1, has some variance 
in it that correlates either zero, or perhaps even egatively. with the 
criterion. This same variance prevents X, from correlating ms highly as 
it might with X,. Variable X, correlates y 


` with X; because they have in 
common that variance not shared by X, In this kind of situation we 


iterion would have 


MULTIPLE PREDICTION 439 


a Xs acquires a negative cee weight, although it may corre- 
> only zero, and not negatively, with the criterion. We call such a 
variable a suppression variable. Its function in a regression equation is 
to suppress whatever variance in other independent variables may not be 
represented in the criterion but which may be in some variable that does 
otherwise correlate with the criterion. 

An example of this came to the author’s attention in testing for pilot 
selection. It was a consistent finding that a vocabulary test, which is as 
pure a measure of the verbal-comprehension factor as we have, correlated 
zero or even slightly negative with the criterion of success in pilot training. 
The same kind of test correlated substantially with a reading-comprehen- 
sion test which also correlated positively with the pilot criterion. The 
reading test correlated positively with the criterion because it measured, 
besides verbal comprehension, such factors as mechanical experience and 
visualization which were also component variances in the pilot criterion. 
The combination of a vocabulary test with the reading test, with a nega- 
tive weight for the vocabulary test, would have improved.predictions over 
those possible with the reading test alone. 

: The examples mentioned thus far have had only positive correlations 
involved. In most practice where human variables are measured we have 
only zero or positive correlations, if all measurement scales are aligned so 
that “good” qualities are given high numerical values. Where genuinely 
negative relationships do occur they are likely to be very small. Exam- 
ples 10 and 11 in Table 16.3 are given more for their academic than for 
their practical interest. Example 10 should be compared with examples 
4,5,and6. They differ only in the value of ras. When rzs becomes nega- 
tive, we see that the increase that occurred when 723 approaches zero 
becomes increasingly negative. When res is 
hen ros is .9. It is doubtful whether situa- 


tions like example 10 occur in nature, though they are theoretically possi- 
ble. The trend could not go too far, however, for with rəs large enough in 
the negative direction we would come to a multiple R greater than 1.0, 


which would mean an impossible situation, even mathematically. 
Example 11 has two negative correlations, 713 and 723. These simply 


mean that variable Xs probably has a reversed scale, for X; is related to 
both X, and Xz in the same direction. Note that the multiple R is the 
same as if both ris and rzs were positive and of the same size numerically 
(example 2). 

Multiple-R Principles in Larger Batteries—The principles illustrated 
above for the three-variable problems also apply in larger combinations of 
variables. The first two principles can be well illustrated by taking other 


appears to continue as 723 
. —.4, R is even greater than w 


440 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


hypothetical examples like those in Table 16.4. There we have a demon- 
stration of how multiple R’s behave as the number of independent variables 
increases from 2 to 20 and as intercorrelations increase from .0 to .6. 


TABLE 16.4.—MULTIPLE CORRELATIONS FROM DIFFERENT NUMBERS OF ĪNDEPENDENT 
VARIABLES EACH CORRELATING .30 WITH THE DEPENDENT VARIABLE BUT WITH 
INTERCORRELATIONS VARYING* 


Number of Intercorrelations 
independent J 
variables .00 pos 10 30 .60 
1 (.30) (.30) (.30) (.30) 
2 4 40 oat 34 
4 60 53 44 36 
9 -90 67 -48 «37 
20 = 79 52 .38 


* Adapted from Thorndike, R. L. Research problems and techniques, in the AAF Aviation 
psychology research program reports, No.3. Washington, D.C.: Government Printing Office, 1947. 


Following Thorndike’s choices, we will assume that each variable cor- 
relates with a criterion to the extent of .3. This is a rather low validity 
coefficient, and about the lower limit of usefulness for a single test or other 
predictive device. We will see, however, how valuable such instruments 


may be when combined in a battery, provided their intercorrelations are 
not too high. w 


In the second row of Table 16.4, when tw 


0 such tests are combined, we 
see how the multiple R decreases from 42 w 


hen 7s is zero to .34 when 193 1S 


tery of 20, except for the case of zero inter- 
correlations, for which the limit of R = 1.0 Was passed when the number of 
tests exceeded 11. In this situation 


of formula (16.12) still applies. The 
tributed by each test would be -09, a 


preceding, it would apparently pay to 
20 were included, Matters of admin- 
istrative effort would have to be b 


alanced against gains in R, 
Table 16.4 tells an even more important Story. The value of having 


MULTIPLE PREDICTION 441 


factor, however, he will often find that each test tends to correlate low with 
the criterion. This is because a practical criterion, of training achieve- 
ment or of job performance, is usually a complex variable; it has a number 
of component variances, each component being a common factor (see 
Ch. 18). If one tries to increase the correlation of a test with a criterion, 
the result is almost invariably to increase the factorial complexity of the 
test; to bring in more different factor variances. This automatically raises 
the correlation of this test with other tests, because they have more factors 
in common. This is the reason that in practice the two principles men- 
tioned first lead to conflicting objectives. Where there has to be a choice, 
it seems wisest to give less attention to the first principle (of maximizing 
correlation of each test with the criterion) and greater attention to the 
second (of minimizing intercorrelations). If there are 20 independent fac- 
tors represented in a practical criterion, and if each is of equal importance, 
each would contribute .05 of the total variance. Each test, measuring 
only one of the factors, would need to correlate only 4/05, which is .224, 
In this case, raising the correlation between any one 


with the criterion. 
here would be no objec- 


test and the criterion would be of little use. T 
tion to a higher correlation. Appropriate weighting would bring the 
test’s contribution to prediction down to required proportions. Thus, it 
can be concluded that low correlations of tests with practical criteria can 
be tolerated provided we can combine enough tests in a battery and 


provided their intercorrelations are near zero.! 


MULTIPLE CORRELATION WITH MORE THAN THREE VARIABLES 


With more than three variables, the best solution of a regression equa- 
tion and of a multiple R is by means of the Doolittle method. This pro- 
cedure will be outlined step by step for a five-variable problem. We shall 
use all the variables represented in Table 16.1, asking what regression 
weights would best predict X, from the other four combined and what the 
correlation of those predictions with obtained X, values would be. 

Solution of Normal Equations.—The mathematically inclined reader 
will appreciate better what is transpiring in applying the Doolittle method 
if he knows that he is actually solving simultaneous equations. The 
unknowns are the beta coefficients, and there are as many equations as 
unknowns. Fora five-variable problem, in which there are four unknown 


betas, the equations are 
lems, see Guilford, J. P- New standards 


EF detailed discussion of these prob! 
eens a 1946. 6, 427-438. 


for test evaluation. Educ. & Psychol. Meas., 


442 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Bie + 723813 + resBis + resis 
rosBi2 + Bis + rasBis + rasBis 
resBie + rabia + Bis + Tabi 
resBi2 + 735813 + Tb + bis 


ri 
T (Normal equations for 

B the solution of beta (16.13) 
ris weights) 

Ts 


Woe a 


The beta coefficients are symbolized in abbreviated form here to conserve 
space. i, in full, would be 62.345 and £3 would be 813.045, and so on. 
The equations are systematic, the r coefficients being arranged as in the 
original table of intercorrelations (see Table 16.1). The betas in the 
diagonal positions might be expected to have coefficients T22, 133) Yas, and 
Tss attached to them, but instead the coefficients attached to these betas 
are all +1.0, as the least-square solution requires. 

The Doolittle-solution Operations.—First we prepare a work sheet like 


that in Table 16.5. There is a column for every variable and the number- 
TABLE 16.5.—SOLUTION OF A MULTIPLE-CORRELATION PROBLEM BY THE DOOLITTLE 
METHOD 
a ee 
Column number r] 3 4 5 1 Check 
Variable X: X: Xı Xs Xi Sum 
Row | Instruction 

A | rox 1.0000} .5620| .4010/ .1970] 4860! 2.6250 
B |A + (—A2) —1.0000)— .6620/— .4010/— .1970|— .4560|—2.6250 
Cl rap — 1.0000] .3960/ 2150| 5830) 2.7560 
D |AXB3 — |= -3158|— .2254|— .1107|— .2613|—1.4752 
2 gt D = -6842| 1706 1043| 3217] 1.2808 
+ (—E3) — [=1.0000|— .2493|— .1524— .4702|—1.8720 

G T4k =a 1.0000 m 
. -3450| 5460| 2.6880 
a A OO — — -1608|/— .0790|— .1865|—1.0526 
= — .0425)/— .0260|— .0802|— .3193 
DE: pom = -7967| 2400| 2793| 1.3161 
: — —1.0000|— .3012|— .3506|—1.6519 
7 — = 1.0000) 3650| 2.1220 
* e F — .0388|— .0916|— .5171 
D la sxe = — .0159/— .0490|— .1952 
é F — .0723|— .0841|— .3964 
P |L+M@+N+0 fa. Ì ae 
Gelb GPa F -8730| 1403| 1.0133 


—1.0000|— .1607|— 1.1607 


MULTIPLE PREDICTION 443 


ing corresponds. A last column is introduced for the purpose of checking 
the calculations, as will be explained. The rows are designated by letters, 
and in the first column, a shorthand instruction is noted. These will be 
explained. i 


Step 1. 


Step 2. 


Step 3. 


Step 4. 


Step 5. 


Step 6. 


Step 7. 
Step 8. 


Step 9. 


Step 10. 


Record in row A the correlations with X». These are obtained 
here from Table 16.1. In column (2), a coefficient of 1.0000 is 
inserted, because it is demanded by the Doolittle method. We 
are going to carry four decimal places throughout the solution 
(one more than those given in the r’s); so we record all numbers 
to four places. 

Sum the values recorded in row A, and give the sum in the last 
or “check” column. This will be used later. 

Divide the numbers in row A each by —1.0000. In the table, 
the instruction reads “A + (—A2),” which means that each 
number in row A is to be divided by the number that appears at 
A2 [row A, column (2)] with sign changed. This includes the 
last column as well. 

Record in row C all the remaining correlations with X;. We say 
“remaining,” because one is already recorded, namely, 723. The 
value of 1.0000 is recorded at C3. 

Sum all the correlations with Xs, including the 5620 in row A. 
Record the sum in the “check” column. A 

The numbers in row D are found by the instruction “A X B3,” 
which means to multiply all the numbers in row A [beginning in 
column (3)] by the number that appears in row B and column (3). 
This number is —.5620 in Table 16.5. 

Row E calls for the addition of all numbers in rows C and D. 
Row F calls for the division of all numbers in row E by the num- 
ber appearing in row E and column (3), with sign changed. This 
number, with sign changed is —.6842. 
We are ready for the first checking of calculations. Sum the 
values in row F, not including the last column. This should equal 
approximately —1.8720 in this particular problem, which was 
found by the steps already described. If there is a serious dis- 
crepancy here (other than in the fourth decimal place), check 
row E by adding values up to the check column. If this does 
not check, there is an error further back, and some recalculating 
is in order. All checks should be satisfied before proceeding. 
In row G, record remaining correlations with X4, with 1.0000 at 


G4. 


y ? 
444 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Step 11. Sum all the correlations with X4, and record in the last column 
in row G. . 

Step 12. Values in row H are the products of values in row A times the 
number at B4. This number is —.4010. 

Step 13. Values in row J are the products of numbers in row E times the 
number at F4, which is —.2493. 

Step 14. Sum the numbers in rows G, H, and 7 for each column. 

Step 15. Divide row J through by the number at J4, with sign changed; 
in other words, by —.7967. 

Step 16. Check by summing row K up to the last column. Does the 
sum agree with the number already found in that coluinn? 

Step 17 and after. By now the abbreviated instructions for each row 


should be clear by analogy to those already given. The final 
check is made in row (Q). 


The illustrative solution is set up for a five-variable problem, but a larger 
number of variables would be treated in a similar manner simply by 
extending the table to more rows and columns, A smaller number of 
variables would mean fewer rows and columns. It will be noticed that 
the table is set up in terms of blocks of work, each one beginning with the 
entrance of correlations for a new variable and ending by dividing by a 
number that will assure a — 1.0000 as the first numberin the last row of that 
block. The work will be found to be very systematic throughout, Any 
variable may be treated as the dependent variable, but it must then occupy 
the next to the last column in the table. 

Solution of the Beta Coefficients. The work represented in Table 16.5 is 
only a part of the Doolittle solution. The end result gives the beta coef- 
ficients, which we find by means of a “back solution,” so called because 


we work in a backward direction, as compared with the work in Table 16.5. 
This work can be tabulated, but it is probably clearest to the beginner in 
the form of equations. The first beta found is 61s, which can be located 


without further ado in Table 16.5. It is the number at the intersection 
of row Q and column (1), but with sign changed (in other words, it is 
described as —Q1). ys is therefore +.1607. The other betas require 


more work; so we will follow the procedure step by step, including again 
the first step already taken, for the sake of completeness. 


Step 1. Bis = =(1 = +.1607 
Step 2. Bu = —K1 + 615(K5) = .3506 + (.1607)(—.3012) = +.3022 
Step 3. Bis = —F1 + Bis(F5) + Bu(F4) 


= .4702 + (.1607)(—.1524) + (-3022)(—.2493) = +.3703 


MULTIPLE PREDICTION 445 


Step 4. B = —Bi + bıs(B5) T bu(B4) T Bi3(B3) 
4650 + (.1607)(—.1970) + (.3022)(—.4010) 
F (.3703)(—.5620) 


ll 


ll 


+.1039 
Before going further, it is well to check the calculations of the beta 
coefficients. This can be done by using the last equation in (16.13): 
Biros + Bistas + Biaras + Bis = 715 
Substituting known values, 
(.1039)(.197) + (.3703)(.215) + (.3022) (.345) + -1607 = .3651 


Since r15 = .365, the check is satisfied, and we may assume that there has 
been no error in computing the betas. This checking procedure can be 
summarized as in Table 16.6, which provides a convenient work plan. 


TABLE 16.6.—A CHECK UPON THE COMPUTATION OF THE BETA COEFFICIENTS 


Bix res Burrs 
X: . 1039 197 0205 
Xs 3703 215 0796 
X; .3022 345 . 1043 
Xs .1607 1.000 . 1607 


IX 3651 = ris 
l ee ae 
The Solution of Regression Weights and the Multiple R. Each b 
coefficient needed in the multiple-regression equation is found from its 
corresponding beta. Equations like those in formulas (16.2a) and (16.20) 
apply. Theb weight for X> should now read in full bı2.345 to indicate that 
we are interested in the relation of X; to Xo, other variables, Xs, Xa and 
Xs, being held constant. For the sake of brevity (as, indeed, we have 


already done for the betas), we shall denote the b’s only by the first two 
In the solution of a multiple R, equation 


aclude as many terms as there are variables. 
f beta times its corresponding r, t.e., 
(16.14) 


subscript numbers b12, bız, etc. 
(16.5) needs to be extended to ir 
R? is the sum of the products o 


R? = Byrig + Biris +H Biria + Bisris 


(General solution of R from beta coefficients) 


The a coefficient in the equation is also found by formula (16.4), extended 
with as many terms as necessary. It is the mean of the X, values minus 
the products of other means times their corresponding b weights, as 

a = M, — beM2 — bi3M 3 — buMı — (16.15) 


(Constant a in a multiple regression equation) 


7 
‘446 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 
All these operations are conveniently carried out in a work sheet like 
Table 16.7, where R and the regression weights are systematically cal- 


TABLE 16.7—SoLUTION OF THE REGRESSION COEFFICIENTS FOR THE MULTIPLE- 
REGRESSION EQUATION 


soa) ar © l©l©l g (8) 

Burn o1/o% by. M, (—My)bue 

. 048314 1.750 | .182 | 19.7 — 3.585 

- 214885 -535 | .198 | 49.5 — 9,801 

. 165001 -469 | .142-| 61.1 — 8.676 

-058655 2.459 | .395 | 29.7 —11.732 

E .487855 = R? X —33.794 

698 = R M, 73.800 

a= 40,006 

l 


culated. The second column contains the four betas. The third contains r 
the original or raw correlations of the four variables with X 1. The sub- 
script & stands for variables 2 to 5 in turn. The fourth column contains 
the cross products of betas times corresponding r’s. Their sum is R°, 
which here is .487855; and by taking the square root we find R is .698. 
This R, with full subscript, would read Rj.2345. 

So much for the multiple R, which we see is not increased very much by 
including two more variables (X2 and X;) over that obtained when we used 
only X;and X4. Then R equaled .677. The coefficient of determination 
is now .4879, or we have accounted for 48.8 per cent of the variance of 
freshman scholarship, as compared with 45.8 per cent without using X2 
and X;. The standard error of estimate (now designated as 4.9345 in full) 
equals 6.5, where before it was 6.7, a trifling change. The index of fore- 
casting efficiency is now 28.4 per cent, where before it was 26.4 per cent. 
It is therefore questionable whether the trouble of measuring and using in 
the regression equation the two additional variables is worth while. This 
is a good example of the way in which each addition: 
diminishing returns in the way of improved predictions 

For the solution of the b coefficients, we introduce in Table 16.7 first the 
column headed o1/c;. This is the ratio by which each beta is to be multi- 
plied. The b coefficients follow in column (6). They tell how many units 
X is increasing for each unit of increase in the other variables. From 
these taken alone, it would seem that X, (interests) has the greatest bear- 
ing upon freshman marks and that X» (high-school average) has the least. 
But such is not the situation. The best comparison of each variable’s con- 
tribution to the variance in X; is to be seen in column (4), where each beta 


al variable yields 


MULTIPLE PREDICTION 447- 


is multiplied by the corresponding raw 7. Here it is seen that X3 con- 
tributes about 21 per cent, X, nearly 17 per cent, whereas Xs contributes 
only about 6 per cent, and X2 about 5 per cent. These statements are 
relative to this correlational situation, with the influences of overlapping 
among the four taken into account. But as to choices among the four 
variables that we have here, they come in the same rank order as the Sr 
products. 

: For the solution of the a coefficient, the last two columns are included. 
This coefficient turns out to be exactly 40.0. The entire regression equa- 


tion now reads 
X’, = 40.0 + .182X2 + 198X; + -142X4 + 395X; 


With this equation, we could predict an X’, for every student, knowing his 
s was said before, the addition of the 


four scores in the other variables. A 
terms involving X: and Xs yield scarcely enough additional accuracy of 
_ prediction to justify their inclusion. One could try combinations of three 
predictive indices, variables Xo, Xs, and X4, or Xs, Xa and Xs, to see 
what happens. From the results in Table 16.7, it would seem that the last- 
mentioned combination of three is the more promising. One could deter- 


mine by another Doolittle solution whether it increased R sufficiently 
above .677 to justify the inclusion of Xs with Xs and X4. 


SHORT SOLUTIONS FOR REGRESSION WEIGHTS 


Solution of a multiple-regression problem, even with the convenient 
Doolittle procedure, becomes energy and time consuming when the num- 
ber of variables is large. The author has known of test batteries involving 
as many as 20 possible scores that could be combined each with its appro- 
priate weight. When there are more than six variables the situation calls 
for possible short cuts or approximation methods. Two methods will be 
mentioned to meet this need, one of which will be illustrated. 

The Wherry-Doolittle Method.—In recent years a modified Doolittle 
solution has been introduced by Wherry. The method was designed to 
meet the requirement of assembling a battery of tests to select personnel 
for some particular assignment. It takes particular cognizance of the 
fact that when a large number of tests are validated singly for the predic- 
tion of a certain criterion, only four or five when combined often seem 
sufficient. As a matter of fact, adding tests beyond the point at which 
all the factors that the tes mmon with the criterion are 


covered often merely contri Even 
ead, W. H., Shartle, C. L., et al. 
merican Book, 1940, 245-255. 


ts measure in co 
putes error variance to the composite. 

1 Described in full in St Occupational counseling 
techniques. New York: A 


448 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


before the point has been reached where there is no apparent improvement 
in prediction, errors have entered into the picture to help determine 
the regression weights. This point was mentioned earlier in connection 
with the discussion of shrinkage formulas [see formulas (16.7) and (16.8)]- 

The principles of the Wherry-Doolittle method are, briefly, as follows. 
One starts with the single test that seems to offer most in prediction of the 
criterion. The method then aids in selection of the second test that will 
have most to add to prediction when combined with the first. A third 
can be selected which will add most by way of prediction when combined 
with the first two, and so on. At each step a shrinkage formula is applied 
in order to determine whether the shrunken R is appreciably larger than 
the previous R. At the point where no further gain according to these 
standards is apparent, no more tests are added. 

The method does undoubtedly offer an efficient way of assembling a 
battery of tests to meet a particular purpose. It results in a list of predic- 
tive instruments that, out of a larger number tried experimentally, is 
minimum for doing the job. The author is inclined toward a quite dif- 
ferent philosophy of development of test batteries, however, which would 
render the Wherry-Doolittle procedure unnecessary when there is suffi- 
cient information about the criterion and the tests.! For this reason the 
space that it would take to explain and demonstrate the Wherry-Doolittle 
method is not used here. The reason why only four or five tests have 
seemed to be the limit in a useful battery is because only a limited number 
of the human abilities and other traits that are involved in a practical 
criterion have been represented in the tests. Although a dozen different 
tests may have been tried out, the same limited number of fundamental 
factors have been measured by them and the measurement is duplicated 
several times over. If a careful study of the criterion is made revealing 
all the factors that are worth trying to predict and if there is sufficient 
variety in the tests to take care of all the factors, it will be found that 
more than four or five tests will probably be needed. If one knows that 
there are 10 traits in the criterion that are worth covering with tests, and 
if it takes 10 tests to do it, then one could put the 10 tests in a battery and 
expect that every one would have something to contribute toward predic- 


tion. A successive selection of tests by a method such as the Wherry- 
Doolittle would then be unnecessary. 


An Iterative Solution of Regression Weights.—The iterative procedure 
for computing beta weights to be described and illustrated is economical, 


particularly for a problem with many variables, and will probably lead to 


1 For a discussion of this at some length, see Guilford, 


J.P. Factor analysis in a test 
development program. Psychol. Rev., 1948, 55, 79-94, 


Correlations 


TABLE 16.8—AN ITERATIVE SOLUTION or THE BETA COEFFICIENTS 


Discrepancies (rk — rik) 


rok rak Yik 


1.000) .562| .401 
.562| 1.000| .396 
-401| 396| 1.000 
197| 215| _.345 


-197| .465]| 


-0021|+ .0091|— . 0009 
.215| .583) — .0223| — .0023| — . 0079| + . 0022 
-345| . 546| — .0081| — . 0002| — . 0042| — . 0002! 

1.000| .365 — 0047| — . 0004| — . 0024| — . 0003 


— .0003] + . 0005| + . 0007| — «0003 
— .0006] + -0005| + . 0007| — «0002 
— 0022) — .0002| -+ . 0001| — . 0003 
— 0013) — . 0006| -+ . 0004| + . 0004 


2.160| 2.173) 2.142 


1.757) 


See eas en (ee 
Trial | B'e B'a Ba B's | Trial Bia B'a Bi 
1 2 3 3 ae 7 11 36 3 16 
2 -14 3 3 2 8 11 37 3 16 
3 14 34 3 2 | 9 105 37 3 16 
4 14 34 3 16 | 10 105 -31 -302 -16 
5 12 .34 3 -16 | i 105 37 -302 „161 
6 12 -36 3 16 | 12 104 37 -302 -161 
104 370 302 161 
Biz Bu Bu Bis 


NOLLOIGHUAd ATAILTAN 


6b? 


450 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


satisfactory results in most cases.! The operations will be described step 
by step and illustrated in Table 16.8 with the use of the same data to which 
‘the Doolittle method was applied earlier. 

The general principle of the method is (1) to guess what the betas are 
going to be, (2) to substitute them in the normal equations (see equations 
16.13), (3) see how much discrepancy there is between the known validity 
coefficients and those that follow from the guessed betas, and (4) make 
corrections in the guessed betas. These steps are repeated until the dis- 
crepancies practically vanish. The correlations that enter into the normal 


equations are listed first in the worktable, upper left-hand corner. From 
here on the steps will be listed. 


Step 1. Compute the sum of each column of correlations, Sraz, where a 
stands for each of the independent variables representing columns 
and k stands for each of the variables in rows in turn. Eraz in the 
first column of correlations is Eror, and so on. 

Step 2. Make a guess for the size of each beta (these will be Be, B's, and 
so on) by dividing the validity coefficient for each test by the sum 
of its column of r’s. These may be made to tw 
start with, but one place w 


is estimated by the ratio -465/2.160 which equals .215, but this 
has been rounded to .2, 


Step 3. Solve each equation, substituting the guessed betas for the 
unknown betas. The first equation would read 


Step 4. Find the discrepancy betwee 


values dı. For the first test, di = rie — re = + 0633. This 


1 The procedure is the author’s version of Reda Th 
originally developed by Kelley and Salisbury. See 
problems and techniques. AAF aviatio: 


orndike’s adaptation of one 
Thorndike, R. L. Research 


MULTIPLE PREDICTION 451 


465 which had been obtained. The di of —.0088 for variable Xs 
indicates that the guessed betas underestimate the validity of that 
test. 

Step 5. Make the first change in the guessed betas. Although we can see 
that the betas for Xs, X4, and Xs have been perhaps overestimated 
and that for X; underestimated, it is most convenient, and per- 
haps just as expedient, to make only one change at a time. Note 
where the largest discrepancy is. It is the +.0633 for variable 
Xo. If we make a change only in 8'19, it will affect only the first 
term in each equation and will involve only the first column of 
correlations. To lower dı to zero for the first test in the list, we 
would need to multiply 1.000 by some amount that will cancel it. 
A change of —.0633 would do this, but it is best to limit adjust- 
ments to the second decimal place at this stage. We will therefore 
reduce B12 by —.06, making it .14. 

Step 6. Modify the discrepancies in line with the change in B’ 2 just men- 
tioned. Every dı will be altered by adding to it the product of 
the change times the corresponding value rex. The first dz will 
be -+.0633 + (—.06)(1.000) = +.0033. The second ds will be 
—.0088 + (—.06)(.562) = —.0425, and so on. 


The general pattern of the procedure is now complete. We keep on 
making successive adjustments as called for, computing the altered dis- 
crepancies, with an attempt to reduce them almost to zero. Since we are ` 
expecting three-place accuracy in the betas, we will find that it pays to 
continue until the discrepancies are not over .0005. After we have 
achieved good adjustment up to the second decimal place in the betas, 
we then proceed to make adjustments in the third decimal place. A com- 
parison of the betas found in Table 16.8 with those found by the Doolittle 
solution (see Table 16.7) will show very good agreement to the third deci- 
mal place. 

From the beta coefficients found in 
pute the multiple correlation, the b weights, 

Great care should be taken for accuracy of computation. Errors may 
creep in at any stage and it still might be possible to reach what looks like 
a satisfactory solution, that is, with, zero discrepancies, with wrong betas. 
It would certainly be well to check the accuracy of the obtained betas as 
was done following the Doolittle solution. There may be some problems, 


with peculiar combinations of correlations, in which the iteration would a 
achieve zero discrepancies even after a long series of trials. The author 
The routine described above 


has not encountered such a situation as yet. 


this manner one may proceed to com- 
and other derived statistics. 


452 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


may be modified as the user of it gains experience. There are opportuni- 
ties for making wiser choices of betas and changes in betas that might cut 
the number of steps. 

Thorndike makes some suggestions concerning the original source of 
guessed betas.’ If we have prior knowledge of how a given test has per- 
formed in a similar battery for making a similar prediction, it would be 
well to start with that knowledge. If the battery is a very large one (10 or 
more) it would be desirable to start with about half of the guessed betas 
equal to zero. Kelley and Salisbury had suggested that each beta be 
guessed as about half the corresponding validity coefficient, but Thorndike 
suggests between one-fourth and one-half is better. If a test correlates 
relatively low with others, the chances are that its beta will go higher than 


original estimates, and, conversely, if it correlates relatively high with 
other tests, its beta will prove to be lower. 


COMBINATIONS OF MEASURES 


The regression equation is a means of combining different measures of 
the same object in order to derive a composite measure or score. The 
scores are summed, each weighted by its regression coefficient. There 
are other ways of combining scores to form a composite. For example, 
one might simply sum the raw scores for each person without applying 
differential weights. This is the common practice in deriving total scores 
of tests composed of subtests of different kinds, 
there is some effort at weighting, e.g., multiplying one score by 2, another 


by 3, and so on. Actually, every test that is composed of items may be 
regarded as a battery of as many tests as there are items. 


is usually an unweighted summation of the item scores, t 
interest and temperament tests there may be differe 
Rarely does a test maker resort to the determination of ri 
for test items, but the same principle that applies to test 
adapted to single tests composed of parts. 
the case of test batteries, there are so ma 
predict in such a variety of situations, that 
to work out the regression weights. 
Because there must be substitute weightin 
tests, it is important to know some of the better substitute procedures for 
the multiple-regression equation and to be able to evaluate the effective- 
ness of a composite derived by any method. The multiple R applies only 
when the optimal regression weights are used; other weights will yield a 
composite that is likely to correlate less with the criterion. There are 


though in some cases 


The total score 
hough in many 
ntial weighting. 
egression weights 
batteries could be 
More often than not, even in 
ny parts, or they are used to 
there is not sufficient incentive 


g procedures in combining 


1 Thorndike, op. cit. 


‘If we were not interested in achieving that mean, W 


MULTIPLE PREDICTION 453 


other problems connected with composite scores that call for attention, 
including what mean and what standard deviation will result when 
measures are combined each with a certain weight. These problems will 
be dealt with in following paragraphs. 

Means of Weighted Composites——When several measures of the same 
object are summed, each with its own weight, the mean of the same kind 
of composite for a sample of objects is given by the equation? 


Mus = DwM; (Mean of a sum of weighted measures) (16.16) 


where w; = weight applied to each variable X;, when i varies from 1 to 
n in a list of 2 variables. 
M: = mean for the same sample of objects in variable X;. 
If we apply this to the b weights computed for the regression equation 
in the prediction of freshmen average grades (see p- 446), the solution 
would be 


Muse = (182)(19.7) + (198) (49.5) + (.142)(61.1) + (.395)(29-7) 
= 33.8 


Thus, the mean of the composite of four variables, including X2 (arithmetic 
test), X; (analogies test), X4 (high-school average), and Xs (interest score), 
weighted with the coefficients 182, .198, 142, and 395, respectively, would 
be 33.8. This value is 40.0 units short of the mean for the criterion (fresh- 
man grades). By adding the difference (40.0) which is the a coefficient 
of the complete regression equation, we obtain a composite mean that 
coincides with that of the criterion. This discussion, in other words, 


explains the need for the 4 coefficient in the complete regression equation. 
e could drop the con- 


stant 40.0 and be left with a mean of 33.8. Pius f 
Standard Deviations of Weighted Composites.—We can likewise estl- 
mate the standard deviation of a composite measure when each component 
has a multiplier or weight. ‘The computation of this oer) 
be clearer, however if we consider the standard deviation of a simp 


unweighted sum first. ; 
The Standard Deviation of Sums When Weights Equal oat 
scores from different tests are summed without applying differentia 


i two 
weights, we may regard the weight for each test to be os i 
Scores saw summed to make the composite, the variance of the comp 


scores is given by the equation" 


1 For proof, see Appendix A. 


454 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


(Variance of a sum of two unweighted (16.17) 
measures) 


where o?, and o*2 = variances of the components. 
ri2 = coefficient of correlation between the two components. 


o, = 07) +072 + 2r 


The expression rj20,02 is known as the covariance of the two components. 
Its relation to correlation can be better shown by relating it to the Pearson 
formula, in which 


Exx 


fi = 
2 No 


If we multiply both sides of this equation by cic. we have 


Exita 


N 


1120102 = 


The parallel between the term at theright and the expression for a variance 
should be obvious. A variance is of the form 2x?\/N or 2x%,/N. A 
covariance is the mean of the cross products of deviations; a variance is a 
mean of the squares of deviations. With this new information as back- 
ground, we may translate equation (16.17) into English by saying that the 
variance of a composite is equal to the sum of variances of the components 
plus twice the covariances of all pairs of those components. This is a 
general principle that is important to remember. 
From equation (16.17) it follows, by taking square roots, that 


= 2. 2 2 (Standard deviation of the sum of 8 
gi a1 F os + Rroga two unweighted measures) (6-1 ) 


A demonstration of how this works o 


ut in a particular sample is given 
in Table 16.9. Ten scores are 


i given for the same individuals in Xa and 
in X, between which the correlation Ta equals zero. If r = .0, the third 


term in formula (16.17) drops out and the variance of the composite is 
merely the sum of the variances of the components. 

In the illustration in Table 16.9, the va: 
are 4.2 and 6.6, respectively. Their sum 
mean of the square found from variable X o The way in which variances 
combine is also demonstrated in Fig. 16.3, which pictures hypothetical 
distributions for X,, X», and their sum Xo. The position of the scale for 
Xe is determined by the juncture of the lines erected at distances of 10 
from the means of Xa and X;. The slanted scale of X. is closer to that 


of X;, consistent with the fact that X, contributes more variance to 
it than does Xq and the fact that the com 


Xə than with Xa. But these are incident 


riances of the two components 
is 10.8, which checks with the 


posite correlates higher with 
al considerations here. The 


MULTIPLE PREDICTION 455 


TABLE 16.9.—Tur VARIANCE AND VARIABILITY OF A COMPOSITE Score Tuar Is THE 
UNWEICHTED Sum or Two UNCORRELATED SCORES 


- 5 "E 
In e, 2 r, n 2 e 5 2 
dividual Ka Xa a | Xo Xo as X+ X) Xe x 
| 
A 1 —4 | 16 6 oj 0 7 =A 16 
B 3 —2| 4 7 +1] 1 10 -1 1 
G 4 —1| 1 4 -2| 4 8 -3 
D 5 l 0 10 | +4 | 16 15 +4] 16 
E 5 oj o s |+2| 4 13 +2| 4 
P 5 0| 0 0 —6 | 36 5 -—6| 36 
G 5 o| 0 6 0| 0 11 o| o 
H 6 44] i s | +2] 4 14 E31) 79. 
i M; +2| 4 | 5 —1 || 12 +1 1 
J 9 |+4]16 |6 oj 0 15 +4 | 16 
F || 260 0|42 | 60 0 | 66 110 0 | 108 
M| 5.0 4.2 | 6.0 6.6 11.0 10.8 
o 2.05 2.57 3.29 


@ 


N 


a 


Score in test B(X,) 
a 


> 


45 S we H 
Score in test A (. Xa) ; 
which the standard deviation of an unweighted 


Fic. 16,3.— ion of the way in 2 sti 3 
sum of two mema related to the standard deviations of those two scores taken sepa- 


rately, when the two are uncorrelated. 


456 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


important demonstration is that when two variables like Xa and X, are 
uncorrelated, we may regard the standard deviation of their composite 
X. as the hypotenuse of a right triangle of which ca and ø, are the legs. 
The old, familiar Pythagorean theorem thus applies to the summation 
of two independent variables. O 

Relation of c, to the Standard Error of a Difference. —The similarity 
between equation (16.18) and equation (9.31) for the standard error of a 
difference will probably have been noticed. The only difference is in the 
algebraic sign of the covariance term, 2r:2¢102, which is positive in the case 
of ø, and negative in the case of e4. Of course, in the preceding discussion 
of ca we have been applying it to distributions of single observations, 
whereas oa has been applied to distributions of means (mean differences). 
The principles are the same, either with means or with single observations. 
Had we written the summation equation in the form XY, = Xa — X», 
instead of X, = Xa+ Xb, we would have been dealing with differences 
instead of sums. On the other hand, in the equation X, = Xa — Xz, we 
can say that we actually have a summation of scores, those for Xa having 
a weight of +1 and those for X, a weight of —1. 

Variance of a Composite of More than Two Components —Equation 
(16.17) can be extended to include any number of unweighted components. 
For each component there would be its variance but there would be as 
many covariance terms to include as there are pairs of components. With 
three components there would be three covariance terms: 270102, 27130103, 
and 2r2sc203. Where there are z components, there are n(n — 1)/2 pairs 
and n(n — 1)/2 covariances to consider. In terms of a general formula 


o%, = Icht 22 rij0i0; (Variance of a sum of any number of (16,19) 
unweighted components) 


where o°; = variance of any one component, X;. 
r= correlation between any com 
ponent with a higher subscri 

ci and o; = standard deviations of the 


Variance of a Composite of Wei, 


ponent X; and any other com- 
pt number. 
two components correlated. 


é ; ghted Components.—When the com- 
ponents are weighted differently, the variance of the composite will reflect 
the weights. Let us begin with the sp 


; begi ecial case of two components. If 
the summation equation is of the form 


Xu. = wX, + WXa 


the variance of X ws is given by the equation! 


1 For proof, see Appendix A, 


oad 
I a ti 


MULTIPLE PREDICTION 457 


Omg = WO H Weote + Prw (Variance of a composite (16.20) 
of two weighted com- 
ponents) 


where w; and wa = weights applied to components X,and NX», respectively. 
As an example of this type of problem, let us use the data on Xs and Xs 
in Table 16.1. If these two variables are used in a composite to predict 
s b weights of .224 and .491, respectively, 
of .578. The predicted X 
+ .491X; would be expected 


Xj, the least-square solution give 
and a multiple R, based upon these weights, 
values based upon the equation X', = .224X4 
to have a standard deviation equal to Riss times o1. This product is 
578 X 9.1, which equals 5.26. Let us see whether formula (16.20) will 
lead to the same result. By substituting the appropriate values, 

aim = (.2242)(19.42) + (.491°)(3.7°) + 2(.345)(.224) (19.4) (491) (3.7) 
(.050176) (376.36) + (.241081) (13.69) + (.690) (4.3456) (1.8167) 
= 18.8842 + 3.3304 + 5.4473 
27.6319 


from which 


ll 


Gws = 5.26 
This agrees exactly with the expectation. 
With weights of +1 for both X, and Xs, application o 
would have given 
ot, = 19.42 + 3.7 + 9(.345) (19.4) (3.7) 
= 439.5782 


f formula (16.17) 


from which 
c, = 21.0 


Variance of a Composite of Any Number of Weighted G ‘omponents.—When 
there are more than two components, each weighted differently, the 
variance of the composite is given by the general formula 

2 2 Variance of a sum of any num- 21 
Pm = Bwa: + ZErijWiTiWjO; ( ber of weighted components) (16.2 ) 
where w; = weight assigned to variable X;, where i takes on values 1 to 
n — 1 in turn. 
correlation between X; an 
is a subscript greater than i. 
o: ando; = standard deviations of X: and Xj, respectively. 


d any other variable X; where j 


rij 


1 See Appendix A. 


458 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


We could apply formula (16.21) to the four components of the regression 
equation predicting freshman grades with the appropriate b weight sub- 
stituted for w in each case. We should find that the standard deviation 
is equal to R times oi, which is .698 X 9.1 = 6.35. The inclusion of 
variables X+ and X; in the regression equation raises the dispersion of the 
predicted grades from 5.26, which it would be with X,and X; only, to 6.35. 

Achieving Any Desired Standard Deviation in a Composite.—In using 
regression equations, the dispersion of the predictions falls short of that 
of the obtained values. This is all right and proper when we are interested 
in predicting an individual’s most probable measure on the scale of 
obtained measures in X;. The regression of predictions toward the 
general mean is a natural phenomenon of imperfect correlation, as was 


pointed out before (Ch. 15). There may be other uses of composites, 
however, that call for other values than those given by the regression 
equation. Suppose that we wanted predictions to spread just as much as 
the obtained values do. Suppose that wi A a em to be dis- 
persed with some standard variability, ior sample, witha ¢ of 10.0, as ona 
T scale, or a ø of 2.0, as on a C scale. The way that kind of goal can be 
achieved will now be explained. 3 a í 


Fortunately, for the solution of this problemen not the absolute 
sizes of the weights that matter; it is thei ratios to one another. So long 
as they bear the same relations to each o her, the correla ion of the com- 
posite with some criterion will remain the same. Consequently, we could 
double, triple, or otherwise change the regression weights by some common 
multiple, without affecting the predictive valu » if re want is to predict 
individuals in the same relative positions in a distribution. å 

The o of the predictions is always related to the ø of the obtained values 
by the extent of the correlation (when optimal weights are used), In a 
multiple-regression problem, ¢ of the predicted values equals R times the 
o of the obtained values. We can therefore make the o of the predictions 
equal the ¢ of the obtained values by dividing each regression coefficient 
by R. A readjusted b coefficient, then, would be computed by the formula. 


: e 
s (Regression coeficient ad- 
b'i = gs j 

12.34...m = Bio.34.. im 


R justed to make the ø of (16.22) 
T2128.. om a composite equal g) 


If the ø desired in the composite is 10, or 2, or any other chosen quantita 
an be achieved by substituting that quantity for ø, in formula 


Achieving Any Desired Mean for a Composite.—In the complete regres- 
sion equation, in order to make the mean of the predictions equal that of 
the obtained values, the a coefficient is introduced. The computation of 


MULTIPLE PREDICTION 459 


a is given by formula (16.15). After one has determined any weights 
whatever to apply to the raw scores of the components of a composite 
measure, the same formula can be applied, putting in the place of M, any 
. desired quantity. This is true because of the reasoning involved in the 
computation of the mean of a composite (see formula 16.16). Thus, if 
we had wanted the mean of the grades predicted by the regression equation 
on p. 447 to be 50, we would have substituted 50 instead of 73.8, the actual 
mean of the grades. The only practical restriction would be to choose a 
mean such that no composite measures would be negative. This means 
that any chosen mean should be at least 2.5 to 3.0 times the standard 


deviation of the composite. 


tuation to deviate from the refined solution. 
substitute weights that approximate the 
aT ery roughly at times, and still not affect the 

+ degree of correlation ve: uch. Instead of applying weights to three 
¥) decimal places, ee kiei ant digit will often suffice, in other words, 
l 


regression coefficients, € ve r 


simple integral wei ai predicting freshman grades from high-school 
a Fi. 


ined, for example, we found the optimal 
might in practice round these to .2 and 
own later! that the change in correlation 
e two cases is from .578, with the three-digit 
weights, to .577, with the one-digit weights. Surely, this loss is quite 
trivial. We could use weights of 2 and 5 had we so chosen. Suppose we 
want even a simpler ratio of the two weights, like 1/2, rather than 2/5. 
With weights of 1 and 2, also, the correlation of composites and grades 
would be .577. With equal weights the correlation would drop to .570. 


.5, respectively. It 
between Xy an ; 


i Even this much loss could be tolerated. 

a = Before the reader draws the conclusion from this isolated example that 
» ~ all differential weighting is unnecessary, however (many generalizations, 
A $ _ unfortunately, are just as sweeping as this would be), it is necessary to 
Kig consider some points not yet brought out. There is no reason to believe 


|. that thisis atypical example. Ordinarily, the more independent variables 
a i a composite, the more can one depart from the weights demanded by 

vi east-square solutions and yet maintain a high level of correlation between 
to which the weights apply. This is why 
forget to bother with differential weight- 
ted, will be 


s 


that composite and a criterion 
-witha test of many items we may 


1 Methods for correlating composites or sums, 
` described beginning on p. 462. 


either weighted or unweigh 


460 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


ing. Ina two-variable composite, however, we have the minimum num- 
ber. We would therefore expect to find the validity of the composite to 
be rather sensitive to changes in weights. Roughly, the explanation in 
this example is that X, (high-school average) has a beta weight about 2.4 
times that for Xs (interest score) and it has a standard deviation about 5 
times as large as that for X5. Even when X, and X; have the same weight 
in the composite, X, contributes to the composite in proportion to its 
standard deviation. This follows from equation (16.17) in which it is 
shown that without differential weights each part’s contribution to total 
variance ıs proportional to its own variance. Without differential weighting 
factors in the equation, then, X, is still weighted much more than X5. 
This illustrates a fact that is not often realized. It is usually assumed that 
merely summing several scores weights those scores equally. Asa rule, 
it does not; it weights them in proportion to their standard deviations. In 
more common-sense language, tests weight themselves. 


of the tests and of their intercorrel 
to do. It is sometimes done. 


grades. The means and standa 
in Table 16.1. 


TABLE 16.10.—THE Process or WEIGHTING COMPONEN 


TS INVERSELY as THEIR 
DISPERSIONS 


Variables 

1 B c D 
M 19.7 49.5 61.1 29.7 
Presses 5.2 17.0 19.4 3:1 
19.4/o (w') 3.73 1.14 1.00 5.24 
Integral weight (W) 4 1 1 5 
Estimated importance (Z)................ 2 pd 5 1 
Combined weight (Zw’)...... 7.46 2.28 5.00 5.24 
Revised integral weight (1V’).. 7 2 5 5 
Simplified weight DF OAA. y ii oe oa era ace oe 3 1 2 2 


We could find a weight equal to 1 


/o for every test 
would be rather small decimal numb 


, but these weights 
ers in some cases, 


A good practical 


| 


MULTIPLE PREDICTION 7 461 


procedure is to select the largest ø in the list, in this case 19.4, and to com- 
pute the ratio 19.4/c for each test. The test with the largest o will have 
the smallest weight. With this particular ratio, the smallest weight will 
then be exactly 1.0. The ratio of any other weights to this one will be 
immediately apparent. It is recommended that all these ratios be 
rounded to the nearest integer, as shown in the fourth row of Table 16.10. 
The weights obtained by this process are 4, 1, 1, and 5, respectively. 
With these weights applied, all four tests would contribute approximately 
the same amount of variance to the total variance. 

The principle of weighting each test inversely as its dispersion is involved 
in the b coefficient. Remember that b is equal to beta times o1/oi, where 
ci is the standard deviation of the test to be weighted. Using this pro- 
cedure, therefore, is virtually equivalent to using an incomplete b coeffi- 
cient. It virtually assumes equal validities for all tests and equal inter- 
correlations, conditions which would lead to equal betas. 

From the solution in Table 16.10, measures X4 and Xs should receive 
weights of 1 and 5, respectively. The difference is in the same direction 
as for the two b weights, which are 224 and .491, respectively, but X4 is 
given relatively about half as much importance as it should have. The 
effect upon the correlation of the composite, weighted this way, is to reduce 
it from the optimal R of .578 to a correlation of 558. The underweighting 
of X4, which is more valid and has a larger beta than Xs, shows up in the 
lower validity of this composite. 

Other Principles of Weighting —Common sense m: ; 
ponent tests should be weighted in proportion to their lengths or their 
means or other obvious properties. To do so may lead the uninformed 
investigator astray. If two tests of unequal length are equally effective, 
in the sense that they produce dispersions in proportion to their lengths, 
when no weights are applied at all they are automatically weighted in 
Proportion to their lengths. Attaching more weight to the long test thus 
merely exaggerates an effect we already have. There is no real justifica- 
tion for weighting tests in proportion to their means, and, when means are 
proportional to standard deviations, the policy would again carry the 
weighting further in the same direction. i 

If parts are regarded as really of equal importance, then a correction 
such as was described above would be in order. Tf the traits measured by 
different tests are regarded as differing in importance, and if we can decide 
upon ratios of importance, we can combine weights based upon these Zakos 
with whatever weights we already have. Suppose, for example, We 
thought that the four variables in Table 16.10 are important in the ratios 
2, 2,5, and 1. Two weights for a variable are combined by finding their 


ay suggest that com- 


464 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


Thus, crude, integral weights of 2 and 5 would give as high a correlation 
of the combination of X, and X; with X, (freshman grades) as would the 
three-digit b coefficients .224 and .491. 

For the general case, with more than two components, the correlation 
with an outside variable is 


Bai recat (Correlation of a weighted 

Lwireo, D : 
P = — sum with an outside (16.26) 
ere VJ Swe; + 2 Eriw woj variable) 


where the symbols are as defined in preceding formulas. 


ALTERNATIVE SUMMARIZING METHODS 


Summative equations represent only one way in which several measures 
may be combined in order to reach single predictions or decisions. There 
are alternative methods some of which are better then regression equations 
in certain situations. The two chief contenders are the multiple cutoff 


method and the profile method, These will be described and their varia- 
tions discussed. 


ary service, for life insur- 
ance, or for employment. Failure to meet the standard on any one test 


may disqualify the individual, Making a particularly good showing in 
one respect is not ordinarily allowed to compensate for a poor showing in 
Some other. The phenomenon of compensation, which the regression- 


equation approach allows, is the chief difference between the two methods, 
in principle. 

Multiple Cutoffs Contrasted with M 
tration of the difference between the 
The two variables re 


ultiple Regression —A geometric illus- 
two methods may be seen in Fig. 16.4. 
3) are both independent 
Xı which is not shown. A 


` pplying a single cutoff score 
based upon a weighted sum of Xzand X3. Assume also that we reject the 


same proportion of the applicants by either method. 
The use of two cutoff scores would reject all individuals to the left of 


the point X», and a vertical line erected at that point, also all individuals 
below the point X;, anda horizontal line drawn at that level. Some indi- 


MULTIPLE PREDICTION 465 


viduals would be rejected on the basis of either variable alone and some on 
the basis of failure to meet standards on both. The single cutoff on the 
weighted composite, however, would be represented by a slanted line. 
This is consistent with the slanted-line system shown in Fig. 16.2. All 
individuals below and to the left of this slanted line would be rejected. 
It is now possible to see what kind of individuals would be accepted by 
the one method and rejected by the other and on which ones the two 
methods agree. The individuals in area A of the ellipse would be accepted 
by either method. The individuals in area R would be rejected by either 


Fic, 16.4.—Geometric comparison of accepted and rejected personnel by the multiple- 
regression-equation method and by the multiple-cutoff method, when approximately 
mA proposhens are selected by either method. (After R. L. Thorndike, AAF Report 
No. 3. 


have been rejected by the multiple- 


method. Individuals in area B would 
cutoff 


regression-equation method but would be accepted by the multiple- 
method. Individuals in areas C and D would be accepted by the regres- 
sion method but rejected by the cutoff method, those in C for different 
reasons than those in D. 

The crux of the comparison of values of the two methods lies in deter- 
mining whether individuals in area B are any better in the criterion than 
those in areas C and D. Individuals in area B are rejected by the one 
method because they combine below-average scores in Xe and X;. They 
just succeed in meeting minimum standards in both variables and so 
would be accepted by the other method. Individuals in areas C and D, 


466 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


although below standards in one variable, are allowed to present com- 
pensating strong scores in the other variable and hence to be accepted by 
theone method. They are regarded as doubtful risks by the other method. 

It can be argued that not enough is known about compensatory effects 
in performances that serve as criteria, and that is quite true. There 
should be some experimental studies of this kind. A vindication of the 
regression method, however, is found in the consistency with which com- 
posite scores continue to correlate as they do in line with multiple-correla- 
tion coefficients that forecast those correlations. If compensatory effects 
did not occur, there would probably be much more shrinkage in correlation 
of sums with criteria than there is. 

An Evaluation of the Multiple-Cutoff Method—If all regressions are 
linear, theoretically, there should be no advantage in selection by multiple 
cutofis over that by composites. This can be explained roughly by the 
fact that in a linear regression there is a continuous improvement in cri- 
terion measures with increased score in an independent variable, and at a 
‘constant rate. Thus, so far as the relationship between the test and the 
criterion is concerned, there is no more reason for putting the cutoff at one 
point rather than another. The cutoff would have to be established on 
the basis of some other determiners, such as success ratio or validity. In 
using a number of tests for selection for a single purpose, presumably it 
would be best to make the most rejections on the basis of the most valid 
test. When a regression is definitely curved, there is a real basis for using 
a cutoff on asingle test. The cutoff would be established in line with the 
region of transition between low and high rates of increase in the criterion 
measure. For example, in Fig. 15.12, somewhere between the scores 90 
and 100 would be a good division point, taking advantage of the rapid 
increase in criterion values as scores increase in X , and at the same time 


recognizing that above a score.of 100 there are no appreciable differences 
in criterion values as X changes. 


There are some practical difficulties in the administration of multiple 


cutofis which make the method less appealing than a regression equation. 
There is the difficulty of establishing several different cutoff points which 
will take full advantage of the differences in validity among the tests and 
which will yield the appropriate numbers of qualified applicants. Once 


the minimum standards are established, however, the method is simple to 
apply. Failure to meet an 


y one of the minimal scores automatically 
means rejection. 


Rejection of an applicant on the basis of a single test is somewhat risky 
as compared with rejection on the basis of a composite score because of the 
fact that the reliability of a single test score is usually less than that for a 


MULTIPLE PREDICTION 467 


composite. If the parts of a composite are positively intercorrelated, the 
total score is more reliable than the part scores. 

Some Variations of the Multiple-cutoff M ethod—A distinction is made 
between a simultancous-hurdles method and a successive-hurdles procedure 
in testing programs using multiple cutofis.! In the former, all applicants 
take all tests; in the latter they do not—they continue to take tests only as 
long as they continue to qualify on them. After the first failure they are 
rejected. In the latter method it is good practice to administer the most 
valid test first. It is the one on which the largest number of rejections 
should be made. It is desirable, too, that if a single attempt is to be 
decisive for so many individuals, the decision should be made on as good 
a basis as possible. If a test of very low validity were given first, some 
who could qualify on the valid test would never have a chance to take it. 
Such individuals might be expected to fail when they took the invalid test 
later, of course, but remember that tests are not perfectly reliable, and a 
person might pass a certain test on one day and fail it on another. The 
successive-hurdles method has the great practical advantage of saving in 
testing time. If there are many more applicants than openings, large 
numbers of applicants can be screened and eliminated from further testing 
by means of a single preliminary examination. 

Other variations in using the multiple-cutoff principle have to do with 
rules concerning rejection. It is not necessary to base a rejection on one 
test alone. The rules might allow for failure on not more than two, Or 
any selected number of variables. The rules might be refined to the extent 
of considering pairs or triads of tests. Rejection might be reserved for 
those who fail on test M only if they also failon test V, and so on. Such 
refinement, however, must be based upon good evidence that it pays m 
terms of better selection. For most purposes such evidence is lacking. 

Profile Methods.—For guidance work and clinical work in general there 
is common preference for seeing an individual’s scores represented in a 
pattern provided in a profile. A single summative score is unsuitable or 
may be unobtainable. A single composite score is unsuitable perhaps 
because the problem is not a selection problem but a classification problem. 
In vocational guidance, clients are “sorted” into vocational categories. 
Tf there were single summative scores already established with satisfactory 
correlations with vocational criteria of many kinds, perhaps the profile 
method would be less important. Clinicians commonly express 4 desire . 


to “see a personality in its totality,” however, and a profile is one approach 
to this end. 


1 Toops, H. A. Philosophy and 
Meas., 1945, 5, 95-124. 


practice of personnel selection. Educ. & Psychol. 


468 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


There are several ways of using profiles. Some prefer the intuition 
given by a general impression of a plotted graph for an individual. as 
prefer to match more definitely described job-requirement or adj ustment- 
requirement patterns with individual trait patterns. It is possible, by 
means of careful research, to define certain adjustment requirements in 
terms of optimal scores in a number of different variables. ; This statement 
implies curved regressions, and that is precisely the condition which favors 
the choice of a profile method to a regression method. 


CScore 


1 


o 


R G A M ' N o 
Fic, 16.5.—An illustration of th 
inventory scores. The cle: 
of experience to be the most favora! 
certain routine type of work. Thes 
region. (Courtesy of R. P. Kreuter, 


Ao Co Csere 
e profile method of selection applied to personality- 
ar portion of the chart represents what is believed on the basis 

le score ranges for personnel who are assigned to a 
cores of the worker shown all fell within the favorable 
Hand Knit Hosiery Company, Sheboygan, Wisconsin.) 


Figure 16.5 demonstrates thi By experience, 
kers in a certain kind of routine task tended 


depressed and emotional side, | 
tary), less ascendant socially, somewhat beset w 
somewhat subjective or hypersensitive, and perhaps none too agreeable or 
cooperative. In most respects the tendencies listed would seem to picture 
a generally “poor” personality picture. Low extremes were unfavorable, 
however; the general tendency was just average or slightly below in most 


0 a ŮŮŮŮĖ 
———————————————— 


MULTIPLE PREDICTION 469 


traits. This is understandable in that such an individual is probably 
lacking in aspirations for positions that require the better qualities and is 
contented with a routine type of work in which adjustments to social 
requirements are relatively easy. The profile is shown of a certain indi- 
vidual who was rated very high in performance at her task. 

For selection purposes, a profile may be handled in various ways. The 
one shown in Fig. 16.5 illustrates one procedure. The favorable zone is 
clear, and less favorable zones are crosshatched. The crosshatching can 
be overprinted on the chart or a plastic mask can be prepared to lay over 
individual charts. Decisions can be based upon the number of favorable 
scores or upon the trend of the individual’s curve as compared with the 
trend of the optimal scores. Ifa single optimal score has been determined 
for every trait, and an “ideal” profile has been drawn, the departure of a 
single profile from the ideal profile can be determined in various ways, none 
of them highly satisfactory. The deviations of each person’s scores from 
the ideal scores can be summarized in various ways. A way that meets 
common statistical principles would be to square the deviation, sum the 
squares, find a mean, and then a square root. This would give a single 
summarizing statistic that has some statistical sanction. There are 
many who would want more than such a number, however, for it does not 
tell us where the deviations are. The general problem of using profiles in 
a rigorous manner is still unsolved. 

Classification of Personnel.—Selection of personnel presupposes @ 
supply of applicants and the possibility of rejecting a proportion of them. 
Attention is upon one kind of assignment to be filled. In the classifica- 
tion of personnel, there are two or more assignments that can be made 
and one might even consider rejecting none, provided proper assignments 
can be found for all. In some situations there is the double problem of 
selection and classification combined. The availability of more than one 
assignment, however, makes possible the utilization of many more appli- 
cants than would be true if there were only one kind of place to fill, for, 
presumably, personnel who do not qualify for one place might well qualify 
for some other. The more different kinds of places there are to fill, the 
smaller the chance of any applicant’s being rejected for every kind. : 

Classification, broadly defined, means assigning individuals each to his 
appropriate category- This would include the operations in educational 
and vocational guidance. In vocational guidance, the number of kinds 
of “assignments” is almost infinite, though the number of major categories 
is limited, In selection we have an assignment with the need to find the 
person for it; in classification in general, we have a number of assignments 
with their requirements in terms of human resources, on the one hand, and 


470 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


a number of persons who have the resources to satisfy or not to satisfy 
each assignment on the other. In vocational guidance, we have one indi 
vidual, with a unique pattern of resources, on the one hand, and a large 
variety of possible occupations or assignments, on the other. 
As demonstrated in this and in preceding chapters, we have solved 
; any of the statistical problems involved in selection. These are bound 
€ with the problems of prediction and of how to evaluate the goodness of 
a prediction. By contrast, the problems of classification have been 
seriously neglected. Assignment to alternative classes requires a differen- 
tial prediction rather than a prediction of a single variable. We have to 
predict how much better the individual will adjust or perform if assigned 
to one category than if assigned to some other category. There are no 
regression equations as yet devised for this particular purpose, either from 
a single pair of single variables or from two composites. Presumably, 
when only two assignments are being considered and two predictive 
indices, we attempt to predict a difference in the criterion variable from a 
difference in the assessment variable, It is reasonable that the more inde- 
pendence there is between two criterion variables the more easily one could 
make a differential prediction and the more confidently would one expect 
to find relatively independent assessment variables, Lack of correlation 
between assessment variables would seem to be just as important as inde- 
pendence between criterion variables, This is probably not the whole 


story, as a preliminary study of this problem by R. L. Thorndike has 
shown.! 


The problem does not seem SO ve: 
categories of classification are invol 


sent one answer. This would mea: 
for different assignments, 
mate solution. From a la 


! Thorndike, R. L. Research 
research program, Report No. 
1947, 125. 


problems and techniques. 


AAF aviation psychology 
3. Washington, D.C.: Go 


vernment Printing Office, 


MULTIPLE PREDICTION 471 


Exercises 


1. Using Data 164, compute a regression equation involving X1, Xz and X4. 
= beta coeflicients, multiple R, and other necessary statistics. Interpret your 
results. 

2. Using Data 16B, compute a regression equation involving Xi, Xs, and Xs. 
Present all statistics as computed in an ordinary solution to a multiple-correlatio; 
problem. Interpret your results. 

3. Give Data 164 a complete solution, using the Doolittle method. Include a 
regression equation and your interpretations. 

4. Do the same for Data 16B as was called for in Exercise 3. 

5. Find the best combination of three predictive indices for either Data 16A or 
Data 16B. 

6. For Data 164 or Data 16B, assume five reason 
thetical individuals in the independent variables for w. 
sion equation, and from them predict Xı values. 

7. Compute standard errors of multiple estimate, 
tion and nondetermination, and indices of forecasting efficiency 


Exercises 1 and 3. Interpret your results. 
8. Determine how much of the total variance in the dependent variable is accounted 


for in the composites in Exercises 2 and 4. Interpret your results. 
9. Compute standard errors of the regression coeflicients in Exercises 1 and 2. 


Draw conclusions. 
10. Apply shrinkage formulas to the results in Exercises 3 and 4, both in the multiple 


R’s and in the standard errors of estimate. State conclusions. 


able sets of scores for five hypo- 
hich you have solved the regres- 


coefficients of multiple determina-_ 
for the problem in 


. 
DATA 164.—INTERCORRELATIONS OF SCORES From Four EXAMINATIONS AND Marks 


RECEIVED IN FRESHMAN MATHEMATICS 


(N = 100) 
Variable X: Xs X: Xs Xi 

X: — -10 53 .39 „51 
X: -70 — .61 +29 51 
Xi -53 .61 = .28 61 
Xs 39 <29 28 — 39 
Xı at sod 61 39 

Mz 4.10 5.44 5.37 4.95 5.70 
or 1.84 2.26 2.14 2.42 

X: = Ohio State psyc hological examination. 


X: = English-usage examination. 

X, = algebra examination. 

Xs = engineering aptitude examination. 

X, = marks in freshman mathematics. 

he optimal beta weights for either Data 16A 


11. By the iterative method, solve for t 
y the Doolittle method. 


or 16B. Compare them with those found b; 


472 FUNDAMENTAL STATISTICS IN PS YCHOLOGY AND EDUCATION 


Data 16B.—INTERCORRELATIONS OF SCORES FROM Four E AMINATIONS AND MARKS 
‘ RECEIVED IN ENGINEERING DRAWING 


(N = 154) 

Variable Xa X: X; Xs xX, 
X: — -53 -24 -28 +33 
X: -53 — -24 oi 34 
X; .24 -24 — -38 -31 
k -28 me | -38 — .41 
Xı 33 34 31 41 
Mz 4.19 5.42 -70 4.85 5.25 

Cz 2.04 2.32 1.93 2.05 1.45 


Xə = Ohio State psychological examination. 
X: = algebra examination, 

X4 = paper-folding test. 

Xs = form-perception test. 

X, = term mark in engineering drawing. 


Do the same 


exercises 3 and 4, also omitting the 
constant a, 


13. Estimate the standard deviation of an unw 
Data 164. Estimate the standard deviation of 


X» and X; in the same data, using the optimal w 
2 and 5 for the same tw. 


such a composite. 


14. Find the correlation of an unweighted combination of X, and X; with Xi, also 
for a combination with weights 


of 2 and 5, respectively. Compare these with the mul- 
tiple Ri o4. 


eighted sum of scores X: and X, in 
a weighted combination of scores in 
eights, in two ways. Use weights of 
o variables, respectively, and estimate the standard deviation of 


CHAPTER 17 
RELIABILITY OF MEASUREMENTS 


The Importance of Reliability—Much of what was said in previous 
chapters assumed that measurements were perfectly reliable, or nearly so. 
By a perfectly reliable measurement we mean one that is completely stable 
or fixed. The same “yardstick” applied to the same individual or object 
should yield the same value from moment to moment, provided the thing 
measured has itself not changed in the meantime. An unreliable yard- 
stick is a “rubbery” yardstick. 


There are times, both in theoretical investigations and in practical work, 


when it is very important to take into account the question of reliability. 
Although numbers, as such, are exact descriptions, just because we amass 
a series of numbers attached to individuals or to observations is no assur- 
ance that those numbers mean much at all about the things measured. 
There is no way of just looking at numbers and telling whether or not they 
stand for any real values or could have been “pulled out ofa hat.? Some 
samples of measurements actually approach the chance condition just 
implied. Others are not exactly “chance” collections of numbers, but 
there is a strong element of chance involved in them. Conclusions to be 
derived from the very same statistical results might differ considerably 
whether we know the measurements to be highly reliable or not. Tests of 
differences and correlation coefficients may often prove to be insignificant 
merely because the measures used were lacking in reliability. Thus, the 
matter of. reliability well merits considerable attention. 


RELIABILITY THEORY 


It is impossible to appreciate the many problems that arise in connec- 


tion with reliability and the several meanings of the term itself without 
going into some of the mathematical ideas underlying the concept. The 
reader will find that on the one hand there is a rigorously defined concep- 
tion of reliability from which it is possible to understand many of the 
peculiarities of measurements, particularly those called test scores, and on 
the other hand there are several operational conceptions of reliability, 
depending upon how it is estimated from empirical data—such as internal- 
consistency, test-retest, and alternate-forms methods. Keeping M mind 
473 


476 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


obtained, measure for the individual. The mean of these error com- 
2 . . . 
ponents is zero, as assumed above. Their variance is equal to 15.2. 
The variance of the total measures can be estimated from the com- 


ponent variances by using formula (16.17) of the preceding chapter. It 


is merely the sum of the two component variances. In the new symbols, 


o% =r Ho, (A Tatl faniance as the sum of true and error vari- (17.2) 


The application of this equation in Table 17.1 gives a total variance of 
120.2, which checks with that computed from the sum of squares of X.. 
In satisfaction of the definition of reliability, we need to find the pro- 


portion of total variance that is true variance. If we divide equation 
(17.2) through by o%,, we have Proportions: 


Ma 3 = 1.00 (Sum of proportions of true and error 
Ta = Ta Ta ~~" variance) 


(17.3) 
In symbolic form, the reliability of these measurem 
ratio o*./o%, or in another form by 1 — 62,/o%,, 
reliability is measured by the ratio of true v. 
by one minus the ratio of error variance to 
stand for the coefficient of reliability, 


ents is given by the 
In other words, the 
ariance to total variance, or 
total variance. Letting ru 


we have two alternative equations: 
e= = 


and (Basic equations for the coefficient of reliability) (17.4) 


iS La 


For the problem of Table 17.1, 


—_— 
fuy = 105.0 
120.2 
= .87 
or 
Tm =1- 15:2 
120.2 
=i 3 
= .87 


If we let e? stand for the pro 


portion of error variance in the total, we 
have the summational equation 


ru +e? = 1.00 (Complementary nature of true and error variance) (17.5) 


RELIABILITY OF MEASUREMENTS 477 


The previous relationships are demonstrated pictorially in Fig. 17.1 
and Fig. 17.2. In the first of these two, dispersions of true measures and 
of total measures are shown. Both have the same mean. The standard 
deviation ø: is greater than cs. This is always true, unless they happen 
to be equal. The effect of errors of measurement is always to increase 
obtained dispersions; never to decrease them, unless they should be cor- 


Vic. 17.1.—Distribution of obtained scores in a test (solid curve) and of the hypothetical 
true components of those scores (dotted curve). Means of obtained and true scores 
coincide, on the assumption that errors of measurement have a mean of zero. The stand- 
ard deviation of the obtained scores is larger than that of their true components. 


related with the true measures or with each other. This suggests that 


standard errors of means and other statistics, which are estimated from 
obtained o’s, are inflated values when measures are at all unreliable. 
Tests of significance are therefore reduced in power by unreliability. The 
only remedy is to improve reliability of measures or to increase the size 
of sample to compensate for errors of measurement. There are no known 
corrections to apply, nor could they probably be justified. 


oå = 105.0 


Amounts of variances 


— True —— Error | 


also proportions of 


Proportions of variances 


Fic. 17.2. —Amounts of true and error variance (first bar) in a test; 


true and error variance (second bar). 
erent manner. Here 


ire in a somewhat diff 
Without the assump- 


Figure 17.2 presents the pictu 
the summative properties of variances is apparent. 
tion of zero correlations for the errors, such a simple picture would be 


impossible. This kind of representation of variances, in tests particu- 
larly, will be encountered with increasing frequency in this and the next 


chapter. 


478 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


The Index of Reliability —The reliability coefficient for a test 
described thus far, is merely an abstract idea. 
kind of self-correlation of a test, as most textbo 
go into the various operations for estimating ry, 
mental meaning to the idea of reliability. Let us think of the true score 
(X,,) and the obtained score (X,) as being two Separate variables, the one 
dependent upon or predictable from the other. This is in spite of the fact 
that the one includes the other. Think of X, as the dependent variable 


, Tu, as 
Operationally, it is some 
oks indicate. Before we 
, let us add more funda- 


Range within which 
A of obfained 
Scores f3// 


Stoo 
(Standard error 
oF an obtained 
score) 


Obtained score (X;) 


vertical ‘disia ee Scores on true scores, with parallel lines rawn at 
Fig. 15.6: of one standard ¢ 2) from the regression. line. (oom with 
score is essentially a standard error of estimate.) 


by or dependent u In a real sense X, is determined 


. ‘3 shows these two variables as X 
ro Aas eva the line of regression of X, upon Xo The correla- 
ween the : ae 
aniindes af delenit! e hes Correlation coefficient is 
i ; an indi 
variance in X It indicates the 


and of X, as the 


iabili z But this is precisel 
what the reliability coefficient (ru) tells us Changs | p y 
shown that quently, we have 
Pin = fu 
anā (Relation of 


i ai (17.6) 
reliability)" index of reliability to a coefficient of 


fto = VTi (17.7) 


RELIABILITY OF MEASUREMENTS 479 


The correlation between test scores and what they actually measure 
(fio) is called the index of reliability. Nothing can correlate with obtained 
scores higher than their correlation with corresponding true scores. The 
statistic 7,,,, then, is often used as an indication of the higher limit of cor- 
relation of any variable with another. Since 7,, is the square root of the 
reliability coefficient, it is always numerically higher than ru. Do not be 
surprised, then, to find that a test may correlate higher with another than 
it correlates with itself. We cannot compute Fts directly from data, but 
it can be estimated from ry or from other information. It isa seldom used 
statistic, but has a definite meaning and could be used along with ru or in 
place of it. 

The Standard Error of an Obtained Score.—Since we can estimate the 
correlation between obtained and true scores and can think in terms of 
prediction of one from the other, we can also ask concerning the errors of 
prediction. We know the obtained scores and from them could predict 
true scores (assuming any mean and standard deviation we please for the 
true-score scale). But there is nothing to be gained by so doing, for the 
predictions would be no more accurate than the scores from which they 


were obtained, and nothing would have happened except a change of unit 


and zero point. 
Suppose that we think in 
true scores to obtained scores. 


the true scores from which to make pi : ea 
terms of determination; of true scores determining obtained scores. u 


errors of measurement also help to determine obtained scores. We are 
interested in the extent of the discrepancies caused by these errors of 
measurement, in other words, in the size of distortions produced in the 
otherwise true-determined measurements. The average of these discrep- 


ancies is estimated by the formula 


terms of prediction in the other direction; from 
This is impossible, since we do not know 
redictions. Let us think rather m 


sure) (17.8) 


; Le. R 
= 0 V1 = fi (Standard error of an obtained mea: 


Cto 
on of obtained scores. 


where o; = standard deviation of the distributi 
ry = reliability coefficient. 
The standard error of an obtaine erro : 
and may be interpreted as such.! Figure 17.3 shows the limits sae s 
at distances of plus and minus one dts from the regression rie aaa 
test with a ct, equal to 2.0 units, we may say that two-thirds s oe 
scores are within 2.0 units of the true scores that determine ve 5 Ai 
certain individual’s true score were 35, for example, the odds ar 


d score is & standard error of estimate 


1 This statistic is frequently called the standard error of measurement. 


480 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


that his obtained score would not exceed 37 or fall below 33. „Allowing 
a margin of 2c, we can say that the odds are 19 to 1 that his obtained score 
will not exceed 39 or fall below 31. Any obtained score does not tell us 
what the corresponding true score is, but with knowledge of the o,, we can 
have a degree of confidence that the true score cannot be very f 
The same standard error gives us some basis for confidence as to whether 
the scores for two persons represent a real difference or whether we can 
tolerate the idea that they could have come from the same true score. 
Reliability at Different Parts of the Test Scale-—Test users frequently 
ask to know the standard error of an obtained score rather than the 
reliability coefficient, because it tells them more directly what they wish 
to know. It tells them whether they should be concerned about differ- 
ences of 2, 4, 8, or 12 points or whether any or all of these differences are 
within the probable range that could have been produced by errors of 
measurement. It may happen, however, that because of a peculiarity of 
the test itself, discriminations are better at one part of the scale than at 
other parts. The Tta Statistic is a blanket index, implying approximately 
equal discriminating power all along the scale. If there is reason to sus- 
pect that discrimination is actually unequal along the scale, this can be 
examined by preparing ascatter diagram, showing the relationship between 
two forms (or halves) of the same test. The standard deviations of the 


columns or rows at different score levels will indicate where predictions 
have the greatest accuracy. 


Computing the Standard Error of an Obtained Score from Differences.— 
Rulon has devised a wa 


y of computing Cta directly from differences 
between scores made by individuals on odd and even pools of items.! 
The equation is 


ar away. 


v2 . 
aa d: (Standard error of an obtained measure computed 
to N from differences) 


f-tests for one individual. 


y that a difference between 
one half-score and the other half-score for the Same person is a measure 


of the error for that individual. Since errors are conceived as deviations, 
squaring, summing, and dividing by y should estimate the amount of 
error variance. That is Precisely what ¢2,,, signifies—the amount of error 
= pO: 19 2 
o O“ = g na i 


variance. Thus, o* = % This fact will be used later as 
another way of estimating the reliability coefficient. 


1 Rulon, P. J. A simplified Procedure for determinin the reliabili 
ability of a test b 
split-halves, Hare. educ. Reo., 1939, 9, 99-103, d y of a test by 


RILIABILITY OF MEASUREMENTS 481 


METHODS or ESTIMATING THE RELIABILITY COEFFICIENT 


We leave theory for a while and see how ru can be estimated from 
empirical data. There are many procedures, falling roughly into the 
three categories (1) internal-consistency reliability, or simply internal con- 
sistency; (2) alternate-forms reliability, or comparable-forms reliability; 
and (3) retest reliability, or test-retest reliability. Cronbach has recently 
proposed that we speak of the second and third types of estimate as coef- 
ficients of equivalence and of stability, respectively.t It would be con- 
venient, also, to speak of the first type as a coefficient of consistency. 

There is no one best way of estimating rx. The type preferred will 
depend upon one’s purposes and the meaning and use one wishes to attach 
to ru. A secondary consideration is availability of data in the proper 
form. Other considerations have to do with testing conditions and the 
kind of test or other measure. 

The nature of procedures differs most in the kind of things that are 
allowed to be considered as true variance and as error variance. What 
may be regarded as true variance in computing one kind of ru may be 
regarded as error variance in computing one of the others. For the sake 
of clear thinking, it will pay us to look at some examples of this. 

Contributors to True and Error Variance.—On the whole, things that 
contribute to an examinee making the same score in “repeated” appli- 
cations of a test are contributors to true variance in the obtained scores. 
The word “repeated” is in quotation marks because the repetition is 
broadly defined to include alternate forms or two halves of the same test. 
On the whole, things that contribute to different evaluations of perform- 
ance of an individual in a test are contributors to error variance. The 
er kind of variance are numerous. Certain of them are of 


sources of eith 
pearance to be recognized and 


sufficient clarity and commonness of ap 


named. 


Let the bar diagram in Fig. 17.4 rep 
scores of a test. Let c? be that proportion of the total variance that would 


be regarded as true variance no matter what method of estimating re 1s 
employed. After all, they should have very much in common. Let ¢%a 
be regarded as those sources of error variance that are unique to the 
alternate-forms method but are regarded as sources of true variance for 
the other methods. The relative sizes of these-portions will vary from 
test to test. Actual examples of e? and of c? will be given shortly. Let 
e2; be sources of error variance particularly when some internal-consistency 


Test “reliabiiity”: its meaning 


resent the total variance in obtained 


1 Cronbach, L. J- and determination. Psychom., 
1947, 12, 1-16. 


482 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


method is used. This portion is also represented as providing determiners 
of errors for the retest method. Finally, let e?, be more distinctly the 
source of error when the retest method is applied, but as being a source of 
true variance for the other methods. The actual situation is probably not 
so simple as this, but it is hoped that this much simplicity will contribute 
to clear conceptions. 

Now for some illustrations of actual determiners of the different kinds 
of variance. These determiners, it must be remembered, are thought of 
as producing individual differences between Scores, either within a single 
application of a test or between applications or between forms. Among 
the determiners of individual differences that are con: 
time and from one form of a test to another is individual status in some 
enduring ability, skill, or other trait or traits. These are the things that 
we wish to measure. Incidental determiners that also belong under por- 


sistent from time to 


dnternal-consistency reliabi/it ty 


2 
eS [eaa] 
aa 

1 


! Retest reliability | 
a 


Alternate-forms reliabiti. ty 


Fic. 17.4.—Proportions of the total-score variance that can be regarded as true variance 
or as error variance depending upon which type of reliability estimate is made. 


tion c? in the diagram (Fig. 17.4 
taking this particular kind of te 
Possibly the ability to underst 


) are general skill in taking tests, skill in 
st, including the form of item used, and 
and test instructions. These additional 


erate to affect variances, however, 
me directions in odd and even scores, 


a 


RELIABILITY OF MEASUREMENTS 483 


as temperature, humidity, lighting, audibility of instructions or signals, 
ventilation, and the like, may differ enough to contribute to error vari- 
ance. There are probably more important changes in the examinee him- 
self. Having taken a certain test, he is not the same individual when 
faced with the second attempt. The skills and knowledge acquired during 
the first administration and in the interval between will have their effects 
upon the second performance. Memory for answers given on the first 
occasion may lead to repetitions of the same answers the second time and 
thus contribute to apparent true variance. Awareness of mistakes made 
in the first attempt, however, leads to changes in responses and hence to 
error variance. Besides possible improvement during the taking of the 
test the first time there is possible improvement resulting from transfer 
effects occurring during the interval between administrations. There are 
also possible maturational factors, particularly in young children. If 
learning and maturational effects were uniform for all individuals, or in 
proportion to their initial positions in the distribution, these determiners 
would not contribute to error variance. But to the extent that learning 
and maturational effects differ from person to person, they do add much 
to error variance. The longer the time interval between test administra- 
tion, the greater the error contributions. In some tests, continuous loss 
in reliability occurs as a function of time interval between test and retest. 
In some psychomotor tests, self correlations of .90 to .96 may be found by 
the odd-even method, but test-retest correlations with a year interval 
between may give correlations of approximately .70. Results of this kind 
were found in testing aviation cadets in the AAF before training and again 
after aircrew training and perhaps some combat. d 

Error variance in the alternate-forms method is contributed chiefly by 
the change in content of the test. Knowledge and skill for dealing with 
one particular set of items may vary somewhat from the knowledge and 
skill for dealing with another set of items, and these variations differ from 
person to person. In addition, depending upon the time interval between 
administrations of the two forms, some of the determiners of error vari- 
ance just mentioned for the retest method may also apply to the alternate- 


forms method. An experiment in the AAF! in which the two forms a 
given in immediate succession and also with four hours of other testing 
in the size of the self corre- 


intervening showed no appreciable change fa 
lation. Longer periods might well be expected to have some © ec ; eu 
If the odd-even technique is used in the split-half method, the chang 


1 Guilford, J. P. (Ed.) Printed classification tests, 
research program reports, Report No. 5, Washington, 
Office, 1947, 25f. 


in AAF aviation psychology 
D.C.: Government Printing 


? 


484 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION 


in conditions that may occur during a single administration of a test are 
rather uniformly distributed over all items in both halves so that their 
effects would not show up as error variance. There are other ways of 
splitting tests into halves, however, which may allow more error variance 
to creep in. If the test is divided by blocks of items, as in odd and even 
half-pages, or odd and even two-minute trials, or first half against second 
half, there is room for systematic shifting of conditions. The effects of 
learning, of temporary changes in mental set (as for speed versus accuracy 
or as to mode of attack on the items), or of fatigue or motivation, then 
might contribute to error variance. These are represented in section e? in 
Fig. 17.4. 

The determiners of error that would affect all methods of reliability 
estimate alike, represented by e°., are such phenomena as fluctuations of 
attention or memory or of motivation that occur from moment to moment 
or from item to item. On some tests, guessing is an important con- 


tributor to error variance. If a test is so difficult that everyone does 
considerable guessing (in the extreme ca 


guessed on every item) the total scores fo: 
distributions whose variances are very lar; 
is a feature in any test, the more difficult 
is likely to be. On the other hand, 
dispersion of scores and the lower th 
ber of alternative responses, 
feature. True-false tests of the sa 


which it will be used, 
Homogeneous versus Heterogeneous Tests.— 
divided roughly into two classes: homogeneous and heterogeneous. The 


former are functionally uniform or, strictly speaking 
“ys A 
They measure one ability or trait. 


completely. Some examples are yo 
perceptual-speed tests, ` The great 
plex. Each one measures at the sa 


Psychological tests can be 


factorially unique. 
Very few tests Satisfy this definition 
cabulary, numerical-operations, and 


9 


RELIABILITY OF MEASUREMENTS 485 


test as a whole measures abilities P, Q, and R, and if each and every item 
also measures those three abilities, for operational purposes the test may 
be regarded as functionally homogeneous. An example of this would be 
an arithmetic-reasoning test or a figure-analogies test. We expect that 
homogeneous tests shall be internally consistent—we want all parts to 
measure the same thing; consequently, some form of internal-consistency 
index is called for. 

If a test is heterogeneous, in the sense that different parts measure 
different traits, we would not expect a very high index of internal con- 
sistency. An example of such a test is a biographical-data inventory. 
This kind of test is composed of questions concerning the examinee’s pre- 
vious life and experiences. Each response to every item is usually vali- 
dated by correlating it with some practical criterion, e.g., success in pilot 
ason one response is valid is not necessarily the same as 
the reason another is valid. They may both predict the criterion and 
yet correlate zero with each other. The parts of such a test, one randomly 
chosen half and another, will probably not correlate very high with each 
other. The test has low internal consistency. An rx computed in this 
manner would not do justice to the test. Neither would an alternate- 
forms ru, if the forms were developed independently. The only meaning- 
ful estimate of reliability for a heterogeneous test is the retest variety. 
If, by chance, a heterogeneous test were developed, each item of which 
correlated with a criterion and yet did not correlate with any other item, 
the internal-consistency reliability would be zero. Yet, the retest reli- 
ability might be substantial or high. A biographical-data test of the type 
referred to above had a characteristic split-half reliability coefficient of 
about .35 and a retest reliability of about .65. Both of these values are 
unusually low, but the test had a validity close to .40 for the selection of 
pilots and consequently was very useful. Such a finding as this, inci- 
dentally, is a dramatic demonstration of the fact that the requirement of 
reliabilities of .90 and above is very unrealistic and that if a selection test 
proves to be valid we can tolerate its low reliability. Standards for 
evaluating both validity and reliability coefficients must be relative. No 


universal limits can be applied. 
It is clear from the discussion & 


training. The re 


bove that the internal consistency and 


the stability of the same test need not agree very closely. There Pl 
very low internal consistency and yet substantial or high retest = ili 4 
It is probably not true, however, that there can be high interna a 

sistency and at the same time low retest reliability, except after very 10 g 
time intervals. If the two indices of reliability disagree for a test, we can 
place some confidence in the inference that the test is heterogeneous. 


